Applied Reliability, Third Edition - Paul A. & Tobias.pdf

May 12, 2017 | Author: fawzi5111963_7872830 | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Applied Reliability, Third Edition - Paul A. & Tobias.pdf...

Description

Statistics

Applied Reliability Since the publication of the second edition of Applied Reliability in 1995, the ready availability of inexpensive, powerful statistical software has changed the way statisticians and engineers look at and analyze all kinds of data. Problems in reliability that were once difficult and time consuming even for experts can now be solved with a few well-chosen clicks of a mouse. However, software documentation has had difficulty keeping up with the enhanced functionality added to new releases, especially in specialized areas such as reliability analysis. Using analysis capabilities in spreadsheet software and two well-maintained, supported, and frequently updated, popular software packages—Minitab and SAS JMP—the third edition of Applied Reliability is an easy-to-use guide to basic descriptive statistics, reliability concepts, and the properties of lifetime distributions such as the exponential, Weibull, and lognormal. The material covers reliability data plotting, acceleration models, life test data analysis, systems models, and much more. The third edition includes a new chapter on Bayesian reliability analysis and expanded, updated coverage of repairable system modeling.

Applied Reliability Third Edition Paul A. Tobias David C. Trindade

Tobias Trindade

Taking a practical and example-oriented approach to reliability analysis, this book provides detailed illustrations of software implementation throughout and more than 150 worked-out examples done with JMP, Minitab, and several spreadsheet programs. In addition, there are nearly 300 figures, hundreds of exercises, and additional problems at the end of each chapter, and new material throughout.

Applied Reliability

Third Edition

Third Edition

C4665

C4665_Cover.indd 1

7/19/11 12:13 PM

Applied Reliability Third Edition

This page intentionally left blank

Applied Reliability Third Edition

Paul A. Tobias David C. Trindade

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2012 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20110804 International Standard Book Number-13: 978-1-4398-9724-9 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

Contents Preface............................................................................................................................................ xiii List of Figures.................................................................................................................................xv List of Tables.............................................................................................................................. xxvii List of Examples......................................................................................................................... xxxi 1. Basic Descriptive Statistics.................................................................................................... 1 1.1 Populations and Samples..............................................................................................1 1.2 Histograms and Frequency Functions.......................................................................2 1.3 Cumulative Frequency Function.................................................................................5 1.4 The Cumulative Distribution Function and the Probability Density Function............................................................................................................ 6 1.5 Probability Concepts.....................................................................................................9 1.6 Random Variables........................................................................................................ 16 1.7 Sample Estimates of Population Parameters............................................................ 16 1.8 How to Use Descriptive Statistics..............................................................................22 1.9 Data Simulation............................................................................................................ 23 1.10 Summary....................................................................................................................... 25 Appendix 1A............................................................................................................................ 26 1.1A Creating a Step Chart in a Spreadsheet....................................................... 26 Problems................................................................................................................................... 27 2. Reliability Concepts.............................................................................................................. 29 2.1 Reliability Function..................................................................................................... 29 2.2 Some Important Probabilities.................................................................................... 31 2.3 Hazard Function or Failure Rate............................................................................... 32 2.4 Cumulative Hazard Function.................................................................................... 33 2.5 Average Failure Rate....................................................................................................34 2.6 Units............................................................................................................................... 35 2.7 Bathtub Curve for Failure Rates................................................................................ 36 2.8 Recurrence and Renewal Rates.................................................................................. 38 2.9 Mean Time to Failure and Residual Lifetime.......................................................... 39 2.10 Types of Data................................................................................................................ 41 2.10.1 Exact Times: Right-Censored Type I............................................................ 41 2.10.2 Exact Times: Right-Censored Type II...........................................................42 2.10.3 Readout Time or Interval Data......................................................................42 2.10.4 Multicensored Data........................................................................................42 2.10.5 Left-Censored Data.........................................................................................43 2.10.6 Truncated Data................................................................................................43 2.11 Failure Mode Separation............................................................................................. 45 2.12 Summary....................................................................................................................... 45 Problems................................................................................................................................... 46

v

vi

Contents

3. Exponential Distribution..................................................................................................... 47 3.1 Exponential Distribution Basics................................................................................. 47 3.2 The Mean Time to Fail for the Exponential............................................................. 51 3.3 The Exponential Lack of Memory Property............................................................ 52 3.4 Areas of Application for the Exponential................................................................. 53 3.5 Exponential Models with Duty Cycles and Failure on Demand.......................... 55 3.6 Estimation of the Exponential Failure Rate λ.......................................................... 56 3.7 Exponential Distribution Closure Property............................................................. 58 3.8 Testing Goodness of Fit: The Chi-Square Test......................................................... 59 3.9 Testing Goodness of Fit: Empirical Distribution Function Tests.......................... 62 3.9.1 D-Statistics: Kolmogorov–Smirnov..............................................................63 3.9.2 W 2-Statistics: Cramer–von Mises..................................................................64 3.9.3 A2-Statistics: Anderson–Darling...................................................................64 3.10 Confidence Bounds for λ and the MTTF.................................................................. 67 3.11 The Case of Zero Failures........................................................................................... 69 3.12 Planning Experiments Using the Exponential Distribution................................. 71 3.13 Simulating Exponential Random Variables............................................................. 75 3.14 The Two-Parameter Exponential Distribution......................................................... 76 3.15 Summary.......................................................................................................................77 Appendix 3A............................................................................................................................ 78 3.1A Test Planning via Spreadsheet Functions................................................... 78 Determining the Sample Size........................................................................ 78 Determining the Test Length Using Spreadsheet Functions....................80 Determining the Number of Allowed Failures via Spreadsheet Functions.................................................................... 81 3.2A EDF Goodness-of-Fit Tests Using Spreadsheets......................................... 81 KS Test�� 81 Problems...................................................................................................................................84 4. Weibull Distribution............................................................................................................ 87 4.1 Empirical Derivation of the Weibull Distribution................................................... 87 4.1.1 Weibull Spreadsheet Calculations................................................................90 4.2 Properties of the Weibull Distribution..................................................................... 90 4.3 Extreme Value Distribution Relationship................................................................. 95 4.4 Areas of Application.................................................................................................... 96 4.5 Weibull Parameter Estimation: Maximum Likelihood Estimation Method...................................................................................................... 98 4.6 Weibull Parameter Estimation: Linear Rectification............................................ 110 4.7 Simulating Weibull Random Variables................................................................... 111 4.8 The Three-Parameter Weibull Distribution........................................................... 112 4.9 Goodness of Fit for the Weibull............................................................................... 113 4.10 Summary..................................................................................................................... 113 Appendix 4A.......................................................................................................................... 114 4.1A Using a Spreadsheet to Obtain Weibull MLEs......................................... 114 4.2A Using a Spreadsheet to Obtain Weibull MLEs for Truncated Data.............................................................................................. 116 4.3A Spreadsheet Likelihood Profile Confidence Intervals for Weibull Parameters...................................................................................... 116 Problems................................................................................................................................. 121

Contents

vii

5. Normal and Lognormal Distributions............................................................................ 123 5.1 Normal Distribution Basics...................................................................................... 123 5.2 Applications of the Normal Distribution............................................................... 129 5.3 Central Limit Theorem.............................................................................................. 130 5.4 Normal Distribution Parameter Estimation.......................................................... 131 5.5 Simulating Normal Random Variables................................................................... 134 5.6 Lognormal Life Distribution.................................................................................... 135 5.7 Properties of the Lognormal Distribution............................................................. 136 5.8 Lognormal Distribution Areas of Application...................................................... 140 5.9 Lognormal Parameter Estimation........................................................................... 141 5.10 Some Useful Lognormal Equations........................................................................ 146 5.11 Simulating Lognormal Random Variables............................................................. 148 5.12 Summary..................................................................................................................... 148 Appendix 5A.......................................................................................................................... 149 5.1A Using a Spreadsheet to Obtain Lognormal MLEs................................... 149 5.2A Using a Spreadsheet to Obtain Lognormal MLEs for Interval Data............................................................................................ 150 Problems................................................................................................................................. 151 6. Reliability Data Plotting.................................................................................................... 153 6.1 Properties of Straight Lines...................................................................................... 153 6.2 Least Squares Fit (Regression Analysis)................................................................. 155 6.3 Rectification................................................................................................................ 159 6.4 Probability Plotting for the Exponential Distribution.......................................... 161 6.4.1 Rectifying the Exponential Distribution................................................... 162 6.4.2 Median Rank Estimates for Exact Failure Times..................................... 163 6.4.3 Median Rank Plotting Positions................................................................. 164 6.4.4 Confidence Limits Based on Rank Estimates........................................... 168 6.4.5 Readout (Grouped) Data.............................................................................. 171 6.4.6 Alternative Estimate of the Failure Rate and Mean Life......................... 172 6.4.7 Confidence Limits for Binomial Estimate for Readout Data................................................................................................. 172 6.5 Probability Plotting for the Weibull Distribution................................................. 175 6.5.1 Weibull Plotting: Exact Failure Times....................................................... 176 6.5.2 Weibull Survival Analysis via JMP............................................................ 178 6.5.3 Weibull Survival Analysis via Minitab..................................................... 178 6.6 Probability Plotting for the Normal and Lognormal Distributions............................................................................................................... 178 6.6.1 Normal Distribution..................................................................................... 178 6.6.2 Lognormal Distribution............................................................................... 181 6.7 Simultaneous Confidence Bands............................................................................. 184 6.8 Summary..................................................................................................................... 187 Appendix 6A.......................................................................................................................... 187 6.1A Order Statistics and Median Ranks........................................................... 187 Problems................................................................................................................................. 191 7. Analysis of Multicensored Data....................................................................................... 193 7.1 Multicensored Data................................................................................................... 193 7.1.1 Kaplan–Meier Product Limit Estimation.................................................. 193

viii

Contents

7.2

Analysis of Interval (Readout) Data........................................................................ 203 7.2.1 Interval (Readout) Data Analysis in JMP and Minitab........................... 205 7.2.2 Minitab Solution........................................................................................... 206 7.2.3 JMP Solution.................................................................................................. 206 7.3 Life Table Data............................................................................................................ 209 7.4 Left-Truncated and Right-Censored Data.............................................................. 213 7.5 Left-Censored Data.................................................................................................... 217 7.6 Other Sampling Schemes (Arbitrary Censoring: Double and Overlapping Interval Censoring)—Peto–Turnbull Estimator............................. 220 7.6.1 Current Status Data...................................................................................... 220 7.7 Simultaneous Confidence Bands for the Failure Distribution (or Survival) Function.........................................................................223 7.7.1 Hall–Wellner Confidence Bands................................................................. 224 7.7.2 Nair Equal Precision Confidence Bands�� 229 7.7.3 Likelihood Ratio-Based Confidence Bands�� 229 7.7.4 Bootstrap Methods for Confidence Bands�� 229 7.7.5 Confidence Bands in Minitab and JMP�� 230 7.8 Cumulative Hazard Estimation for Exact Failure Times..................................... 231 7.9 Johnson Estimator . ................................................................................................... 233 Summary................................................................................................................................ 235 Appendix 7A.......................................................................................................................... 235 7.1A Obtaining Bootstrap Confidence Bands Using a Spreadsheet.................................................................................................... 235 Problems................................................................................................................................. 239 8. Physical Acceleration Models........................................................................................... 241 8.1 Accelerated Testing Theory...................................................................................... 241 8.2 Exponential Distribution Acceleration................................................................... 243 8.3 Acceleration Factors for the Weibull Distribution................................................. 244 8.4 Likelihood Ratio Tests of Models............................................................................ 256 8.5 Confidence Intervals Using the LR Method........................................................... 258 8.6 Lognormal Distribution Acceleration..................................................................... 260 8.7 Acceleration Models.................................................................................................. 265 8.8 Arrhenius Model........................................................................................................ 266 8.9 Estimating ΔH with More than Two Temperatures.............................................. 268 8.10 Eyring Model.............................................................................................................. 273 8.11 Other Acceleration Models....................................................................................... 279 8.12 Acceleration and Burn-In.......................................................................................... 281 8.13 Life Test Experimental Design................................................................................. 283 8.14 Summary.....................................................................................................................284 Appendix 8A.......................................................................................................................... 285 8.1A An Alternative JMP Input for Weibull Analysis of High-Stress Failure Data................................................................................................... 285 8.2A Using a Spreadsheet for Weibull Analysis of High-Stress Failure Data................................................................................................... 287 8.3A Using A Spreadsheet for MLE Confidence Bounds for Weibull Shape Parameter............................................................................ 288 8.4A Using a Spreadsheet for Lognormal Analysis of the High-Stress Failure Data Shown in Table 8.5........................................... 290

ix

Contents

8.5A

Using a Spreadsheet for MLE Confidence Bounds for the Lognormal Shape Parameter....................................................................... 291 8.6A Using a Spreadsheet for Arrhenius–Weibull Model................................ 293 8.7A Using a Spreadsheet for MLEs for Arrhenius–Power Relationship Lognormal Model.................................................................. 294 8.8A Spreadsheet Templates for Weibull or Lognormal MLE Analysis........ 296 Problems................................................................................................................................. 297 9. Alternative Reliability Models......................................................................................... 301 9.1 Step Stress Experiments............................................................................................ 301 9.2 Degradation Models.................................................................................................. 307 9.2.1 Method 1........................................................................................................308 9.2.2 Method 2........................................................................................................309 9.3 Lifetime Regression Models..................................................................................... 313 9.4 The Proportional Hazards Model............................................................................ 320 9.4.1 Proportional Hazards Model Assumption............................................... 320 9.4.2 Properties and Applications of the Proportional Hazards Model.............................................................................................. 320 9.5 Defect Subpopulation Models.................................................................................. 321 9.6 Summary..................................................................................................................... 335 Appendix 9A.......................................................................................................................... 335 9.1A JMP Solution for Step Stress Data in Example 9.1.................................... 335 9.2A Lifetime Regression Solution Using Excel................................................ 336 9.3A JMP Likelihood Formula for the Defect Model........................................342 9.4A JMP Likelihood Formulas for Example 9.7 Multistress Defect Model Example.................................................................................342 Problems.................................................................................................................................342 10. System Failure Modeling: Bottom-Up Approach.........................................................345 10.1 Series System Models................................................................................................345 10.2 The Competing Risk Model (Independent Case)..................................................346 10.3 Parallel or Redundant System Models....................................................................348 10.4 Standby Models and the Gamma Distribution..................................................... 350 10.5 Complex Systems....................................................................................................... 352 10.6 System Modeling: Minimal Paths and Minimal Cuts.......................................... 356 10.7 General Reliability Algorithms................................................................................ 360 10.8 Burn-In Models.......................................................................................................... 362 10.9 The “Black Box” Approach: An Alternative to Bottom-Up Methods................. 365 10.10 Summary..................................................................................................................... 367 Problems................................................................................................................................. 367 11. Quality Control in Reliability: Applications of Discrete Distributions................. 369 11.1 Sampling Plan Distributions.................................................................................... 369 11.1.1 Permutations and Combinations................................................................ 370 11.1.2 Permutations and Combinations via Spreadsheet Functions................ 371 11.1.3 The Binomial Distribution........................................................................... 372 11.1.4 Cumulative Binomial Distribution............................................................. 374 11.1.5 Spreadsheet Function for the Binomial Distribution.............................. 375 11.1.6 Relation of Binomial Distribution to Beta Distribution.......................... 376

x

Contents

11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9

Nonparametric Estimates Used with the Binomial Distribution....................... 377 Confidence Limits for the Binomial Distribution................................................. 377 Normal Approximation for Binomial Distribution.............................................. 379 Confidence Intervals Based on Binomial Hypothesis Tests................................ 380 Simulating Binomial Random Variables................................................................ 382 Geometric Distribution.............................................................................................384 Negative Binomial Distribution............................................................................... 385 Hypergeometric Distribution and Fisher’s Exact Test.......................................... 386 11.9.1 Hypergeometric Distribution..................................................................... 386 11.9.2 Fisher’s Exact Test......................................................................................... 387 11.9.3 Fisher’s Exact Test in JMP and Minitab..................................................... 389 11.10 Poisson Distribution.................................................................................................. 391 11.11 Types of Sampling...................................................................................................... 393 11.11.1 Risks................................................................................................................ 394 11.11.2 Operating Characteristic Curve.................................................................. 395 11.11.3 Binomial Calculations.................................................................................. 395 11.11.4 Examples of Operating Characteristic Curves......................................... 396 11.12 Generating a Sampling Plan.....................................................................................400 11.12.1 LTPD Sampling Plans.................................................................................. 402 11.13 Minimum Sample Size Plans................................................................................... 406 11.14 Nearly Minimum Sampling Plans.......................................................................... 406 11.15 Relating an OC Curve to Lot Failure Rates............................................................ 407 11.16 Statistical Process Control Charting for Reliability.............................................. 410 11.17 Summary..................................................................................................................... 414 Problems................................................................................................................................. 414 12. Repairable Systems Part I: Nonparametric Analysis and Renewal Processes............................................................................................................... 417 12.1 Repairable versus Nonrepairable Systems............................................................. 417 12.2 Graphical Analysis of a Renewal Process............................................................... 419 12.3 Analysis of a Sample of Repairable Systems.......................................................... 424 12.3.1 Solution Using Spreadsheet Methods........................................................ 428 12.4 Confidence Limits for the Mean Cumulative Function (Exact Age Data).........430 12.4.1 True Confidence Limits................................................................................430 12.5 Nonparametric Comparison of Two MCF Curves................................................ 435 12.6 Renewal Processes.....................................................................................................440 12.7 Homogeneous Poisson Process................................................................................ 441 12.7.1 Distribution of Repair Times for HPP.......................................................442 12.8 MTBF and MTTF for a Renewal Process................................................................446 12.9 MTTF and MTBF Two-Sample Comparisons........................................................ 450 12.10 Availability.................................................................................................................. 453 12.11 Renewal Rates............................................................................................................. 455 12.12 Simulation of Renewal Processes............................................................................ 456 12.13 Superposition of Renewal Processes....................................................................... 457 12.14 CDF Estimation from Renewal Data (Unidentified Replacement)..................... 458 12.15 Summary..................................................................................................................... 462 Appendix 12A........................................................................................................................ 462 12.1A True Confidence Limits for the MCF......................................................... 462 12.2A Cox F-Test for Comparing Two Exponential Means................................ 465

Contents

xi

12.3A Alternative Approach for Estimating CDF Using the Fundamental Renewal Equation................................................................ 466 Problems................................................................................................................................. 469 13. Repairable Systems Part II: Nonrenewal Processes..................................................... 471 13.1 Graphical Analysis of Nonrenewal Processes....................................................... 471 13.2 Two Models for a Nonrenewal Process.................................................................. 474 13.3 Testing for Trends and Randomness....................................................................... 477 13.3.1 Other Graphical Tools.................................................................................. 478 13.4 Laplace Test for Trend...............................................................................................480 13.5 Reverse Arrangement Test........................................................................................ 482 13.6 Combining Data from Several Tests........................................................................ 486 13.7 Nonhomogeneous Poisson Processes..................................................................... 488 13.8 Models for the Intensity Function of an NHPP..................................................... 489 13.8.1 Power Relation Model.................................................................................. 489 13.8.2 Exponential Model....................................................................................... 496 13.9 Rate of Occurrence of Failures................................................................................. 499 13.10 Reliability Growth Models.......................................................................................500 13.11 Simulation of Stochastic Processes.......................................................................... 512 13.12 Summary..................................................................................................................... 515 Problems................................................................................................................................. 515 14. Bayesian Reliability Evaluation....................................................................................... 517 14.1 Classical versus Bayesian Analysis......................................................................... 517 14.1.1 Bayes’ Formula, Prior and Posterior Distribution Models, and Conjugate Priors..................................................................... 518 14.1.2 Bayes’ Approach for Analysis of Exponential Lifetimes........................ 519 14.2 Classical versus Bayes System Reliability.............................................................. 522 14.2.1 Classical Paradigm for HPP System Reliability Evaluation................... 522 14.2.2 Bayesian Paradigm for HPP System Reliability Evaluation................... 522 14.2.3 Advantages and Disadvantages of Using Bayes’ Methodology............ 522 14.3 Bayesian System MTBF Evaluations........................................................................ 523 14.3.1 Calculating Prior Parameters Using the 50/95 Method.......................... 524 14.3.2 Calculating the Test Time Needed to Confirm an MTBF Objective............................................................................................. 526 14.4 Bayesian Estimation of the Binomial p................................................................... 529 14.5 The Normal/Normal Conjugate Prior.................................................................... 532 14.6 Informative and Noninformative Priors................................................................ 533 14.7 A Survey of More Advanced Bayesian Methods................................................... 536 14.8 Summary..................................................................................................................... 537 Appendix 14A........................................................................................................................ 538 14.1A Gamma and Chi-Square Distribution Relationships.............................. 538 Problems................................................................................................................................. 538 Answers to Selected Exercises................................................................................................. 541 References.................................................................................................................................... 551 Index.............................................................................................................................................. 557

This page intentionally left blank

Preface It has been more than 15 years since the publication of the second edition of Applied Reliability. We continue to receive positive feedback from old users, and each year, hundreds of engineers, quality specialists, and statisticians discover the book for the first time and become new fans. So, why a third edition? There are always new methods and techniques that update and improve upon older methods, but that was not the primary reason we felt the need to write a new edition. In the past 15 years, the ready availability of relatively inexpensive, powerful, statistical software has changed the way statisticians and engineers look at and analyze all kinds of data. Problems in reliability that were once difficult and time consuming for even experts can now be solved with a few well-chosen clicks of a mouse. Additionally, with the quantitative solution often comes a plethora of graphics that aid in understanding and presenting the results. All this power comes with a price, however. Software documentation has had difficulty keeping up with the enhanced functionality added to new releases, especially in specialized areas such as reliability analysis. Also, in some cases, different well-known software packages use different methods and output different answers. An analyst needs to know how to use these programs effectively and which methods are the most highly recommended. This information is hard to find for industrial reliability problems. The third edition of Applied Reliability was written to fulfill this software documentation need for reliability analysts. We chose two popular software packages that are well maintained, supported, and frequently updated: Minitab and SAS JMP. Minitab is popular in universities and JMP is widely used within leading high-technology companies. Both packages have extensive capabilities for reliability analysis and graphics that improve with every new release. In addition, we included solutions using spreadsheet programs such as Microsoft Excel and Oracle OpenOffice Calc. With a little formula programming, spreadsheet functions can solve even very difficult reliability problems. Spreadsheet methods cannot easily produce custom, specialized reliability graphics, however, and are included primarily because they are so widely available and surprisingly powerful. Unfortunately, producing detailed examples using software has many pitfalls. We would generate graphics of screenshots and describe how to obtain specific platforms and run analyses only to have a new release of either JMP or Minitab come out, which looked and operated somewhat differently. Even spreadsheet mechanics change with new releases. We frequently had to go back and redo problem solutions to remain current with updates. Finally, we realized that our readers would inevitably see panels and screens coming from later releases of these software packages that might differ slightly from the screenshots shown in our text. However, it is likely that the basic methods and approaches will remain the same for a long time. Many of the suggestions we made to software developers based on methods described in the second edition are now a part of these packages or will be in future releases. Two examples are the very useful defect model (incorporated in JMP release 9) and the ability to input negative frequencies when analyzing truncated data (already in JMP 8). We stated in the preface to the second edition: “Our goal remains that the text be application oriented, with numerous practical examples and graphical illustrations.” Statements of theory and useful equations are essential building blocks, but what the industrial xiii

xiv

Preface

reliability analyst needs to know is how to apply these building blocks to numerically solve typical problems. The new edition has more than 150 worked-out examples, many done with both JMP and Minitab and even spreadsheet programs. Along with these examples, there are nearly 300 figures, and hundreds of exercises and additional problems at the end of each chapter. We also took the opportunity to add new material throughout. Sometimes, this new material increased the level of difficulty, and we chose to put this material in appendices at the end of several chapters. Since many of the examples, exercises, and problems use lengthy spreadsheets or worksheets of failure data, we have many of these files on the publisher’s website for the book. These data sets, in Excel, JMP, or Minitab format, can be accessed via the “Downloads & Updates” tab on the book’s web page at http://www.crcpress.com/product/isbn/ 9781584884668. Data sets are organized by book chapter and given a name either mentioned in the text or based on the number of the example, exercise, or problem to which they relate. There is also a directory containing Excel templates that can be used to find maximum likelihood solutions for Weibull and lognormal multistress, life test, or field data. There are even templates incorporating the defect model or for testing equal slopes or equal parameters across several cells of data. Another powerful software package not used in the text deserves mention: SPLUS, with the addition of Bill Meeker’s SPLIDA (SPLUS Life Data Analysis) downloadable front end, which offers graphics and analysis capabilities that can also be used successfully on many of the data sets in the third edition. Finally, we gratefully acknowledge the comments and suggestions made by our colleagues who provided feedback on the sections of the second edition and/or reviewed draft copies of many prepublication chapters of the third edition. In particular, we appreciate the comprehensive suggestions and critiques offered by Wayne Nelson, Doug Montgomery, Judy Koslov, Bill Heavlin, Ed Russell, Ken Stephens, Leon Lopez, and the many users of the text.

List of Figures Figure 1.1

Histogram of fuse breakdown measurements.................................................. 4

Figure 1.2

Plot of PDF cumulative frequency function...................................................... 5

Figure 1.3

Plot of PDF for the exponential distribution.....................................................6

Figure 1.4

CDF for exponential distribution........................................................................ 7

Figure 1.5

EDF for fuse data................................................................................................... 8

Figure 1.6

Minitab histogram of fuse data........................................................................... 9

Figure 1.7

The uniform PDF................................................................................................. 20

Figure 1.8

The CDF for the uniform distribution.............................................................. 20

Figure 1.9

Beta density functions........................................................................................22

Figure 1.10

Mean and Sigma do not tell us enough. These four distributions have the same mean and standard deviation.................................................. 23

Figure 1.11

Simulating ordered random variables.............................................................. 25

Figure 1.1A

Spreadsheet table for experiment...................................................................... 26

Figure 1.2A

Derived spreadsheet table for step chart.......................................................... 26

Figure 1.3A

Step chart.............................................................................................................. 27

Figure 2.1

Cumulative distribution function.....................................................................30

Figure 2.2

Bathtub curve for failure rates........................................................................... 36

Figure 2.3

Example of component failure data.................................................................. 37

Figure 2.4

Readout data.........................................................................................................42

Figure 3.1

The exponential distribution failure rate h(t).................................................. 48

Figure 3.2

Histogram of memory chip failure data.......................................................... 51

Figure 3.3

Piecewise approximation of actual failure rate...............................................54

Figure 3.4

Memory chip data histogram compared to f(t) shape.................................... 58

Figure 3.5

Illustration of Dn+ and Dn− statistics for KS test.................................................63

Figure 3.6

JMP histogram of test data.................................................................................65

Figure 3.7

Empirical distribution function plot and KSL D-statistics............................65

Figure 3.8

Minitab exponential analysis of failure times................................................. 66

Figure 3.9

JMP exponential analysis of failure times....................................................... 66

Figure 3.10

Spreadsheet columns for evaluating the product r × kr;1−α. ........................... 74

xv

xvi

List of Figures

Figure 3.1A

Spreadsheet entries to determine sample size................................................ 79

Figure 3.2A

Spreadsheet entries to determine test length..................................................80

Figure 3.3A

Spreadsheet entries to determine number of failures allowed..................... 81

Figure 3.4A

Spreadsheet entries for KS goodness-of-fit test............................................... 82

Figure 3.5A

mpirical distribution function and exponential CDF model E (mean time to fail = 100)......................................................................................83

Figure 4.1

Weibull CDF......................................................................................................... 91

Figure 4.2

Weibull PDF.......................................................................................................... 91

Figure 4.3

Weibull failure rate (hazard rate)...................................................................... 92

Figure 4.4

JMP data table for exact times, censored data analysis............................... 103

Figure 4.5

Inputs for JMP Fit Parametric Survival analysis—exact times................... 104

Figure 4.6

JMP Weibull MLEs for Exercise 4.5 exact times, censored data................. 104

Figure 4.7

JMP data table for interval data from Exercise 4.5....................................... 105

Figure 4.8

Inputs for JMP Fit Parametric Survival analysis—interval data................ 105

Figure 4.9

JMP Weibull MLEs for Exercise 4.5 interval data......................................... 106

Figure 4.10

JMP data table for exact times treated as interval data................................ 106

Figure 4.11

Minitab analysis for exact times, right-censored data analysis.................. 107

Figure 4.12

Minitab analysis inputs for interval data....................................................... 108

Figure 4.1A

Genweibest spreadsheet with interval data from Example 4.5.................. 114

Figure 4.2A Solver entries for MLE analysis of the filled-in Genweibest spreadsheet................................................................................... 115 Figure 4.3A

Genweibest spreadsheet after solver finds MLEs......................................... 116

Figure 4.4A

Genweibest solution for MLEs based on truncated data............................. 117

Figure 4.5A

First iteration—the starting point is the MLE solution spreadsheet.......... 118

Figure 4.6A

Solver run (first iteration)................................................................................. 118

Figure 4.7A

Second iteration run of Goal Seek................................................................... 119

Figure 4.8A

Second iteration run of Solver.......................................................................... 119

Figure 4.9A

Start of fourth iteration confirming convergence when there is no change................................................................................... 120

Figure 5.1

The normal distribution PDF........................................................................... 124

Figure 5.2

The normal distribution CDF.......................................................................... 124

Figure 5.3

Plot of data from Table 5.2................................................................................ 133

Figure 5.4

Relationship of lognormal distribution to normal distribution................. 136

List of Figures

xvii

Figure 5.5

The lognormal distribution PDF..................................................................... 137

Figure 5.6

The lognormal distribution CDF..................................................................... 138

Figure 5.7

The lognormal distribution failure rate......................................................... 139

Figure 5.8

Minitab inputs and output for Table 5.5 data................................................ 143

Figure 5.9

JMP inputs and output for Table 5.5 data....................................................... 144

Figure 5.10

Minitab input and output screens for lognormal interval data................. 145

Figure 5.11

JMP input and output screens for interval data............................................ 146

Figure 5.1A

Genlnest.xls after converging to MLEs for Table 5.5 data........................... 150

Figure 5.2A

Excel solution for interval data to obtain lognormal MLEs�� 151

Figure 6.1

Straight line plot................................................................................................ 154

Figure 6.2

Regression line example................................................................................... 155

Figure 6.3

JMP regression example................................................................................... 158

Figure 6.4

Minitab regression example............................................................................. 159

Figure 6.5

Ideal gas law plot............................................................................................... 160

Figure 6.6

Ideal gas law plot using rectification.............................................................. 160

Figure 6.7

xponential probability plot of Table 6.2 data, exact times, E median ranks..................................................................................................... 166

Figure 6.8

Minitab probability plot of Table 6.2 data...................................................... 166

Figure 6.9

Exponential model fit to data, exact times, LS MTTF estimate.................. 168

Figure 6.10

xponential probability plot, exact times, 90% confidence E limits on transformed CDF.............................................................................. 169

Figure 6.11

xponential probability plot, exact times, 90% approximate E confidence limits on failure time quantiles................................................... 170

Figure 6.12

Exponential probability plot, readout data.................................................... 172

Figure 6.13

xponential probability plot of readout data with approximate E 90% pointwise confidence limits on time t.................................................... 174

Figure 6.14

xponential CDF plot of readout data with approximate 90% E pointwise confidence limits on time t............................................................ 174

Figure 6.15

Weibull probability plot, exact times.............................................................. 177

Figure 6.16

JMP output Weibull model analysis, exact times......................................... 179

Figure 6.17

Minitab output Weibull model analysis, exact times................................... 180

Figure 6.18

Lognormal probability plot, exact times, n = 600.......................................... 183

Figure 6.19

Extrapolation to T50 in lognormal probability plot....................................... 184

Figure 6.20

EDF plot with 90% confidence level band...................................................... 186

xviii

List of Figures

Figure 7.1

Eight units on stress: six failures and two censored (units 2 and 4)..................................................................................................... 194

Figure 7.2

Nonparametric survival and CDF curves...................................................... 195

Figure 7.3

CDF and two-sided 95% confidence limits.................................................... 198

Figure 7.4

JMP dialog box for exact data example.......................................................... 199

Figure 7.5

JMP output for exact data example................................................................. 200

Figure 7.6

JMP save estimates table for exact data example.......................................... 200

Figure 7.7

Minitab dialog box for exact data example.................................................... 201

Figure 7.8

Minitab summary output for exact data example........................................ 202

Figure 7.9

Minitab graph for exact data example............................................................ 203

Figure 7.10

initab spreadsheet for readout example, censoring at M beginning of interval........................................................................................ 206

Figure 7.11

Minitab dialog boxes for readout data example........................................... 206

Figure 7.12

Minitab output for readout example, actuarial estimate............................. 207

Figure 7.13

JMP dialog box for readout example.............................................................. 208

Figure 7.14

J MP output for readout example, assuming censoring at beginning and end of interval......................................................................... 208

Figure 7.15

Minitab worksheet (partial) for Table 7.1 data............................................... 211

Figure 7.16

Minitab output (partial) actuarial table and failure plot............................. 212

Figure 7.17

Number at risk................................................................................................... 216

Figure 7.18

Cumulative failure distribution....................................................................... 216

Figure 7.19

Plot of CDF estimate versus time for left-censored data............................. 219

Figure 7.20

JMP data table and output for left-censored data in Table 7.14................... 219

Figure 7.21

Minitab output and graph for left-censored data in Table 7.14................... 220

Figure 7.22

Disk drive data CDF plot..................................................................................222

Figure 7.23

Current status data table and analysis output in JMP.................................222

Figure 7.24

Current status data table and analysis output in Minitab...........................223

Figure 7.25

Spreadsheet showing the calculations for determining the Hall–Wellner confidence bands for the first 18 observations in Table 7.20......................................................................................................... 227

Figure 7.26

aplan–Meier F(t) estimate and Hall–Wellner 90% confidence K bands: linear (H-W), log (H-W LT), and logit (H-W LG) transformations.................................................................................................. 228

Figure 7.27

JMP output showing Nair 95% EP confidence bands.................................. 231

Figure 7.1A

Partial table for capturing “max” of bootstrap runs.................................... 237

List of Figures

xix

Figure 7.2A

Dialog box for creating a data table................................................................ 237

Figure 7.3A

One-way data table with varying max values............................................... 237

Figure 7.4A

Bootstrap 90% confidence bands..................................................................... 238

Figure 7.5A

CDF estimate and 90% bootstrap confidence bands.................................... 238

Figure 8.1

Worksheet for the data in Table 8.2................................................................. 246

Figure 8.2

Weibull probability plot in worksheet............................................................ 247

Figure 8.3

Worksheet using indicator variables.............................................................. 248

Figure 8.4

Portion of data analysis summary output..................................................... 249

Figure 8.5

Minitab output for LS analysis of the data in Table 8.2............................... 249

Figure 8.6

Minitab Weibull plot of the data in Table 8.2................................................. 250

Figure 8.7

Partial Minitab worksheet for the data in Table 8.2...................................... 250

Figure 8.8

Minitab Weibull plot of the data in Table 8.2: equal slopes......................... 251

Figure 8.9

JMP-7 data table for analysis of the data in Table 8.2................................... 252

Figure 8.10

JMP-7 reliability/survival screen inputs........................................................ 252

Figure 8.11

JMP–Weibull plot and MLEs for the data cells of Table 8.2......................... 253

Figure 8.12

J MP data table for common-slope analysis of the data in Table 8.2 using indicator variables..................................................................254

Figure 8.13

JMP fit parametric survival model screen inputs for common-slope analysis of the data in Table 8.2............................................254

Figure 8.14

JMP fit parametric survival model screen outputs for common-slope analysis of the data in Table 8.2............................................ 255

Figure 8.15

Region for finding lognormal parameter confidence limits....................... 258

Figure 8.16

J MP output for Weibull parameter likelihood confidence limits for data in Table 8.2................................................................................ 259

Figure 8.17

Lognormal plot—cumulative percent failure data from Table 8.5..................................................................................................... 262

Figure 8.18

Lognormal plot (common slope)..................................................................... 262

Figure 8.19

JMP analysis for calculating same-slope MLEs............................................ 264

Figure 8.20

JMP analysis results with same slope............................................................. 265

Figure 8.21

Arrhenius plot using LS estimates.................................................................. 270

Figure 8.22

Minitab inputs for Arrhenius–Weibull fit...................................................... 270

Figure 8.23

Minitab Arrhenius–Weibull analysis output................................................ 271

Figure 8.24

JMP Arrhenius–Weibull analysis entry screens........................................... 272

Figure 8.25

JMP Arrhenius–Weibull analysis results....................................................... 272

xx

List of Figures

Figure 8.26

JMP worksheet for Arrhenius–power relationship– lognormal model............................................................................................... 275

Figure 8.27

JMP inputs for Arrhenius–power relationship– lognormal model............................................................................................... 276

Figure 8.28

JMP results for Arrhenius–power relationship–lognormal model fit.............................................................................................................. 276

Figure 8.29

J MP dialog to estimate survival probabilities for Arrhenius– power relationship–lognormal model fit....................................................... 277

Figure 8.30

JMP survival analysis at possible use conditions......................................... 277

Figure 8.31

Minitab inputs for Arrhenius power law analysis....................................... 278

Figure 8.32

Minitab plots for Arrhenius power law analysis.......................................... 278

Figure 8.33

Minitab results of accelerated life-test analysis............................................ 279

Figure 8.34

Minitab results of accelerated life-test predictions....................................... 279

Figure 8.35

Minitab input box for accelerated life-test prediction.................................. 280

Figure 8.1A

JMP inputs for analysis of the data in Table 8.2............................................ 286

Figure 8.2A Analysis results for the data in Table 8.2 (MLEs assuming equal slopes)....................................................................................................... 286 Figure 8.3A

Excel spreadsheet for calculating MLEs of individual cells........................ 287

Figure 8.4A

Excel spreadsheet for calculating MLEs assuming a common shape....... 288

Figure 8.5A

Using Goal Seek confidence bound calculations.......................................... 289

Figure 8.6A

Solver example for confidence limit calculations......................................... 290

Figure 8.7A

Excel spreadsheet for calculating MLEs of individual cells........................ 291

Figure 8.8A

Excel spreadsheet for calculating MLEs of same-sigma cells..................... 292

Figure 8.9A

Spreadsheet confidence bound calculation for common sigma................. 292

Figure 8.10A Solver screen for confidence bound calculation............................................ 293 Figure 8.11A Excel spreadsheet fit for Arrhenius–Weibull model..................................... 293 Figure 8.12A S preadsheet calculation of Arrhenius–power relationship model parameter estimates.............................................................................. 294 Figure 8.13A Spreadsheet for calculating use CDFs and confidence bounds.................. 295 Figure 8.14A Goal Seek and Solver inputs for calculating profile likelihood limits................................................................................................. 296 Figure 9.1

Arrhenius step stress data schematic............................................................. 302

Figure 9.2

Plot of step stress data for ΔH = 0.5, 0.86, and 1.0.......................................... 303

Figure 9.3

preadsheet for calculating step stress interval widths under S Arrhenius acceleration......................................................................................304

Figure 9.4

Spreadsheet for calculating step stress Arrhenius lognormal MLEs........305

List of Figures

xxi

Figure 9.5

preadsheet for calculating step stress Arrhenius/power law S Weibull MLEs..................................................................................................... 307

Figure 9.6

Degradation data extrapolation to failure times...........................................309

Figure 9.7

Projected degradation failure times, 105°C.................................................... 312

Figure 9.8

Projected degradation failure times, 125°C................................................... 312

Figure 9.9

JMP spreadsheet for the plant/process field reliability data....................... 315

Figure 9.10

Fit parametric survival screen for plant/process reliability data............... 316

Figure 9.11

JMP analysis results for plant/process reliability data................................ 317

Figure 9.12

Minitab spreadsheet for the plant/process field reliability data................ 318

Figure 9.13

Minitab regression with life data screen for plant/process data................ 318

Figure 9.14

Minitab regression with life data output for plant/process data............... 319

Figure 9.15

Lognormal probability plot of 15 out of 100.................................................. 323

Figure 9.16

Lognormal probability plot of 15 out of 18.................................................... 323

Figure 9.17

JMP data table for defect model analysis of the Example 9.5 data............. 325

Figure 9.18

JMP nonlinear analysis entry screen.............................................................. 326

Figure 9.19

JMP nonlinear analysis platform control screen........................................... 326

Figure 9.20

MLEs for Example 9.5 defect model data....................................................... 327

Figure 9.21

Before and after panels for P = 1 for the nonlinear analysis....................... 328

Figure 9.22

Excel spreadsheet for MLE fitting of lognormal defect model data........... 329

Figure 9.23

Defect model, multistress cell data................................................................. 330

Figure 9.24

JMP MLEs for one cell Weibull defect analysis............................................. 332

Figure 9.25

Excel MLE fitting of defect model data inputted as truncated data..........334

Figure 9.26

JMP table showing defect model data inputted as truncated data....................................................................................................334

Figure 9.27

JMP fitting of defect model data inputted as truncated data...................... 335

Figure 9.1A

JMP data table for analysis of Example 9.2 Arrhenius step stress data..... 336

Figure 9.2A

Formula for Weibull, one cell, defect model.................................................. 337

Figure 9.3A

Formula for Weibull, three cells, defect model............................................. 338

Figure 9.4A

ormula for Weibull, effective delta temperature F acceleration, defect model................................................................................ 338

Figure 9.5A

EXCEL spreadsheet for the plant process field reliability data ................. 339

Figure 9.6A

Solver screen for the plant/process field reliability data.............................340

Figure 9.7A

JMP negative log-likelihood column formula for the lognormal defect model....................................................................................340

xxii

List of Figures

Figure 9.8A

Formula for Weibull, one cell, defect model.................................................. 341

Figure 9.9A

Formula for Weibull, three cells, defect model............................................. 341

Figure 9.10A Formula for Weibull, effective delta temperature acceleration, defect model................................................................................342 Figure 10.1

Five-component system diagram....................................................................354

Figure 10.2

Reduced five-component system diagram..................................................... 354

Figure 10.3

Fully reduced five-component system diagram...........................................354

Figure 10.4

Six-component system diagrams..................................................................... 355

Figure 10.5

Backup components.......................................................................................... 355

Figure 10.6

Backup components with switch.................................................................... 355

Figure 10.7

Equivalent diagram of system with working switch................................... 356

Figure 10.8

Bridge structure system diagram.................................................................... 357

Figure 10.9

Equivalent to bridge structure system diagram........................................... 358

Figure 10.10 Minimal cut analysis of bridge structure diagram...................................... 359 Figure 10.11 Example 10.10 system diagram........................................................................ 359 Figure 10.12 General reliability algorithm failure rate example....................................... 361 Figure 10.13 Failure rate before and after burn-in..............................................................364 Figure 11.1

Binomial distribution........................................................................................ 374

Figure 11.2

CDF for binomial distribution......................................................................... 375

Figure 11.3

Binomial data analysis in JMP......................................................................... 381

Figure 11.4

JMP binomial confidence interval calculation.............................................. 381

Figure 11.5

Binomial CDF n = 4, p = 0.5.............................................................................. 383

Figure 11.6

Input for Fisher’s exact in Minitab.................................................................. 390

Figure 11.7

Output for Fisher’s exact in Minitab............................................................... 390

Figure 11.8

Operating characteristic curve........................................................................ 397

Figure 11.9

Operating characteristic curves for different acceptance numbers.......................................................................................... 397

Figure 11.10 Operating characteristic curves for different sample sizes......................... 398 Figure 11.11 AOQ curve with AOQL.................................................................................... 399 Figure 11.12 Spreadsheet set-up for determining acceptance sampling plan................. 401 Figure 11.13 LTPD versus sample size for different acceptance values...........................404 Figure 11.14 Three-sigma control chart for binomial proportions................................... 411 Figure 11.15 Cumulative count control................................................................................. 412 Figure 12.1

Dot plot of repair pattern.................................................................................. 419

List of Figures

xxiii

Figure 12.2

Cumulative plot................................................................................................. 420

Figure 12.3

Interarrival times versus system age.............................................................. 421

Figure 12.4

Recurrence rate versus system age................................................................. 421

Figure 12.5

Lognormal probability plot..............................................................................422

Figure 12.6

CDF model fit versus observed........................................................................423

Figure 12.7

Event plot of repair histories for five systems...............................................425

Figure 12.8

Repair history (cumulative plots) for five systems....................................... 426

Figure 12.9

Repair history for two systems........................................................................ 426

Figure 12.10 Repair history for five systems........................................................................ 427 Figure 12.11 Mean cumulative repair function................................................................... 428 Figure 12.12 Spreadsheet method for estimating the MCF............................................... 429 Figure 12.13 S preadsheet method for estimating the MCF and naive confidence limits................................................................................................ 431 Figure 12.14 MCF and 95% naive confidence limits............................................................ 431 Figure 12.15 JMP data table for recurrence analysis........................................................... 432 Figure 12.16 JMP dialog box for recurrence analysis......................................................... 432 Figure 12.17 JMP output for recurrence analysis................................................................ 433 Figure 12.18 Minitab data worksheet for repairable system analysis..............................434 Figure 12.19 Minitab dialog boxes for repairable system analysis...................................434 Figure 12.20 Minitab output for analysis of five repairable systems................................ 435 Figure 12.21 MCF comparison between East and West Coast locations.......................... 437 Figure 12.22 MCFs for East and West Coast locations........................................................ 437 Figure 12.23 MCF differences between East and West Coast locations........................... 438 Figure 12.24 JMP plot of MCF difference between East and West Coast locations................................................................................................... 439 Figure 12.25 Partial spreadsheet for time to kth repair......................................................443 Figure 12.26 Gamma distribution CDF for time to kth repair, MTBF = 1000 hours............................................................................................444 Figure 12.27 Spreadsheet example for spare parts determination...................................445 Figure 12.28 Alternating renewal process............................................................................ 453 Figure 12.29 Markov two-state model...................................................................................454 Figure 12.30 Partial spreadsheet for 10 HPP systems with MTBF = 1000........................ 456 Figure 12.31 C umulative plots of 10 simulated HPP systems with MTBF = 1000 (censored at 10,000 hours)................................................................................. 457

xxiv

List of Figures

Figure 12.32 Superposition of renewal processes for system of three components.............................................................................................. 458 Figure 12.33 S ystem of c components viewed as a superposition of renewal processes.............................................................................................. 459 Figure 12.1A Spreadsheet setup for variance estimates......................................................463 Figure 12.2A Calculations for variance estimates................................................................464 Figure 12.3A MCF variance, standard error, and confidence limits.................................. 465 Figure 12.4A Possible outcomes for time differences in renewal estimation................... 467 Figure 13.1

Dot plot of repair pattern.................................................................................. 472

Figure 13.2

Cumulative plot (improving trend)................................................................. 472

Figure 13.3

Interarrival times versus system age (improving trend)............................. 473

Figure 13.4

Dot plot of repair pattern.................................................................................. 473

Figure 13.5

Cumulative plot (degrading trend)................................................................. 473

Figure 13.6

Interarrival times versus system age (degrading trend).............................. 474

Figure 13.7

Power law model rectification......................................................................... 475

Figure 13.8

Exponential model rectification...................................................................... 476

Figure 13.9

Exponential model fit........................................................................................ 476

Figure 13.10 Average repair rates versus time (renewal data)........................................... 479 Figure 13.11 Average repair rates versus time (improving)............................................... 479 Figure 13.12 Average repair rates versus time (degrading)............................................... 479 Figure 13.13 Cumulative plot of repair data......................................................................... 485 Figure 13.14 Interarrival times versus system age.............................................................. 486 Figure 13.15 Cumulative plot of MLE model fit to system data........................................ 495 Figure 13.16 Spreadsheet setup for applying SOLVER routine (MLE parameters)........ 495 Figure 13.17 Spreadsheet showing SOLVER results........................................................... 495 Figure 13.18 Cumulative plot of MLE model fit to system data........................................ 496 Figure 13.19 Cumulative plot of HPP and NHPP models fit to system data................... 499 Figure 13.20 D uane plot of cumulative MTBF versus cumulative time with least squares line......................................................................................................... 503 Figure 13.21 Duane plot with modified MLE lines............................................................. 509 Figure 13.22 Duane plot of software cumulative MTBF estimates................................... 511 Figure 13.23 Excel trendline dialog box................................................................................ 512 Figure 14.1

Bayesian gamma prior and posterior from Example 14.1............................ 522

List of Figures

xxv

Figure 14.2

Calling up Goal Seek........................................................................................ 525

Figure 14.3

Using Goal Seek to find the gamma prior a parameter a............................ 525

Figure 14.4

Calculating the gamma prior b parameter b.................................................. 526

Figure 14.5

Bayesian beta prior and posterior from Example 14.6.................................. 531

Figure 14.6

Prior and posterior densities from Example 14.7.......................................... 535

This page intentionally left blank

List of Tables Table 1.1

Sample Data on 100 Fuses....................................................................................3

Table 1.2

Frequency Table of Fuse Data.............................................................................. 3

Table 1.3

Cumulative Frequency Function for Fuse Data................................................5

Table 1.4

Possible Outcomes for Drives............................................................................ 13

Table 1.5

Properties of Distributions Used in Reliability Studies................................. 19

Table 3.1

Equivalent Failure Rates in Different Units..................................................... 48

Table 3.2

Sample Data of Equivalent Month of Memory Chip Failure........................ 50

Table 3.3

Frequency Table of Memory Chip Data........................................................... 50

Table 3.4

Chi-Square Goodness-of-Fit Worksheet for the Memory Chip Data.............................................................................................. 61

Table 3.5

preadsheet Functions for k-Factors for Confidence Limits on S the Exponential Failure Rate.............................................................................. 69

Table 3.6

Exponential Zero Failure Estimates.................................................................. 70

Table 3.7

Summary of Exponential Distribution Properties......................................... 78

Table 3.1A

ercentage Points for Modified Kolmogorov D*-Statistics for P F(t) Known............................................................................................................83

Table 3.2A

ercentage Points for Modified Kolmogorov D*-Statistics P (Mean Unknown).................................................................................................84

Table 4.1

Solution to Example 4.1....................................................................................... 89

Table 4.2

Weibull Distribution Properties........................................................................ 92

Table 4.3

Weibull Formulas Summary.............................................................................. 94

Table 4.4

32 Field Failure Times from 101 Burned-In Components............................ 109

Table 4.1A

djustment Constants for L for Computing Likelihood A Profile Intervals.................................................................................................. 117

Table 5.1

Standard Normal CDF Values......................................................................... 126

Table 5.2

Example 5.3 Worksheet..................................................................................... 133

Table 5.3

Results of Simulation Example (1000 Iterations per Cpk).............................. 135

Table 5.4

Lognormal Formulas and Properties............................................................. 139

Table 5.5

Life Test Failure Data (20 Units on Test)......................................................... 150

Table 6.1

LINEST Output.................................................................................................. 157

Table 6.2

ailure Times of 20 Components under Normal Operating F Conditions (Time in Hours)............................................................................. 164 xxvii

xxviii

List of Tables

Table 6.3

Probability Plot Values, Exponential Distribution, Exact Times (n = 20)...................................................................................................... 165

Table 6.4

9 0% Confidence Interval Estimates, Exponential Distribution, Exact Failure Times (n = 20).............................................................................. 169

Table 6.5

Probability Plotting Values, Exponential Distribution, Readout Data (n = 100)....................................................................................... 171

Table 6.6

Readout Data (n = 100), 90% Pointwise Confidence Limits......................... 173

Table 6.7

Weibull Example, Exact Times (n = 20)........................................................... 176

Table 6.8

Lognormal Example, Exact Times (n = 600)................................................... 183

Table 6.9

Percentage Points for Modified Kolmogorov D*-Statistics.......................... 185

Table 6.10

Failure Times with EDF and 90% Confidence Band Limits........................ 186

Table 7.1

Product Limit Estimated Survival Probabilities........................................... 195

Table 7.2

Variance and Standard Error Estimates......................................................... 196

Table 7.3

Two-Sided 95% Confidence Limits.................................................................. 198

Table 7.4

JMP Data Table for Exact Data Example........................................................ 199

Table 7.5

Minitab Worksheet for Exact Data Example.................................................. 201

Table 7.6

Summary of Readout (Interval) Data............................................................. 204

Table 7.7

Joint Risk and Product Limit Estimates for Readout (Interval) Data with Losses Occurring Randomly......................................................... 205

Table 7.8

J MP Data Tables for Readout Example, Censoring Occurring at Beginning and at End of Interval................................................................ 207

Table 7.9

able of CDF Estimates for Readout Example, Random T Censoring within Intervals.............................................................................. 208

Table 7.10

Survival Data from Six-Week Reliability Study............................................ 210

Table 7.11

Life Table (Actuarial) Estimation of Failure Probabilities........................... 210

Table 7.12

Table of Stress Results for 20 Units................................................................. 214

Table 7.13

artial Table of Ordered Ages of Entry, Failure, or Censored to P Determine Number at Risk.............................................................................. 215

Table 7.14

Table of Observed Times to Failure................................................................ 217

Table 7.15

Analysis of Left-Censored Data...................................................................... 218

Table 7.16

Analysis of Left-Censored Data...................................................................... 218

Table 7.17

Disk Drive Data.................................................................................................. 221

Table 7.18

Percentiles of Distribution of Kolmogorov dN ,1−α Statistics........................225 N Critical Values of dN ,1−α for H-W Confidence Bands When K N (tmax ) < 0.75.........................................................................................225

Table 7.19

List of Tables

xxix

Table 7.20

Failure and Censor Times for Primary Mechanism (N = 50)...................... 226

Table 7.21

all–Wellner 90% Confidence Bands—Untransformed and H with Log and Logit Transformations.............................................................. 228

Table 7.22

Cumulative Hazard Calculation..................................................................... 232

Table 7.23

Cumulative Hazard Calculation for Exact Failure Times Example........... 233

Table 7.24

Multicensored Results.......................................................................................234

Table 7.25

ossible Outcomes for n = 5, FSFSF, Assuming Possible P Eventual Suspension Failures..........................................................................234

Table 7.26

Mean Order Numbers and Median Ranks....................................................234

Table 7.27

Mean Order Numbers Using Johnson Formula............................................ 235

Table 7.1A

Original Data, CDF Estimate, Standard Error, and Hall–Wellner Terms................................................................................... 236

Table 7.2A

One Bootstrap Run of Data in Table 7.1A....................................................... 236

Table 8.1

General Linear Acceleration Relationships................................................... 242

Table 8.2

Weibull Temperature–Stress Failure Data..................................................... 246

Table 8.3

Weibull Least Square Parameter Estimates................................................... 247

Table 8.4

Experimental Design Matrix........................................................................... 261

Table 8.5

Lognormal Stress-Failure Data........................................................................ 261

Table 8.6

Lognormal Stress Cell Parameter Estimates................................................. 263

Table 8.7

Summary of Arrhenius–Weibull Data Analysis........................................... 273

Table 8.1A

Spreadsheet Templates...................................................................................... 297

Table 9.1

Arrhenius Step Stress Example....................................................................... 303

Table 9.2

Degradation Data............................................................................................... 311

Table 9.3

Summary of Shipment and Failure Data....................................................... 314

Table 9.4

Negative Log-Likelihood Values for Different Models................................ 332

Table 9.5

Step Stress Data for Problem 9.1......................................................................343

Table 11.1

Binomial Cumulative Distribution Function: n = 4, p = .5........................... 382

Table 11.2

Cumulative Probability.....................................................................................384

Table 11.3

Contingency Table 1 for Fisher’s Exact Test................................................... 388

Table 11.4

Contingency Table 2 for Fisher’s Exact Test................................................... 388

Table 11.5

JMP Data Table................................................................................................... 389

Table 11.6

Fisher’s Exact Test Results................................................................................ 389

Table 11.7

Matrix of Possible Choices................................................................................ 394

Table 11.8

Binomial Probability Calculations for Sample of Size n = 50 and p = 0.02.............................................................................................. 396

xxx

List of Tables

Table 11.9

robability of Three or Less Failures in Sample of Size n = 50 P for Various Lot Percent Defective Values....................................................... 396

Table 11.10

LTPD Sampling Plans....................................................................................... 403

Table 11.11

LTPD Evaluation................................................................................................405

Table 11.12

Spreadsheet for Nearly Minimum Sampling Plans..................................... 407

Table 11.13

Minimum Sample Sizes for Zero Rejects at Various Probabilities............. 413

Table 12.1

Repair Age Histories (Hours)..........................................................................425

Table 12.2

Repair Histories for Four Machines................................................................ 429

Table 12.3

Repair Histories for Services at Two Different Data Centers...................... 436

Table 12.4

Repair Histories for Two Locations................................................................ 439

Table 12.5

ne-Sided Lower Confidence Bound Factors for the MTBF O (Failure-Censored Data)................................................................................... 447

Table 12.6

One-Sided Lower Confidence Bound Factors for the MTBF (Time-Censored Data).......................................................................................448

Table 12.7

ne-Sided Upper Confidence Bound Factors for the MTBF O (Failure of Time-Censored Data).....................................................................448

Table 12.8

Test Length Guide............................................................................................. 449

Table 12.9

Failure Times in Hours..................................................................................... 451

Table 12.10

Different Availability Levels............................................................................454

Table 12.1A Repair Histories for Five Systems...................................................................463 Table 13.1

Probability of R Reversals by Chance for n = 4.............................................484

Table 13.2

ritical Values of Rn, % of the Number of Reversals for the C Reverse Arrangement Test...............................................................................484

Table 13.3

Steps for Fisher’s Composite Test.................................................................... 487

Table 13.4

Critical Values for Goodness-of-Fit Test......................................................... 492

Table 13.5

Repair History in Hours (Simulated Data: a = 0.25 and b = 0.50)................ 492

Table 13.6

Transformed Repair Times............................................................................... 494

Table 13.7

1 and R 2 Values to Multiply MTBF Estimate and Obtain R Confidence Bounds (Test Ends at nth Fail).................................................... 506

Table 13.8

1 and P2 Values to Multiply MTBF Estimate and Obtain P Confidence Bounds (Test Ends at Time T)..................................................... 507

Table 13.9

Results of Software Evaluation Testing.......................................................... 510

Table 14.1

Bayesian Paradigm: Advantages and Disadvantages.................................. 523

Table 14.2

Beta Distribution Parameters........................................................................... 531

List of Examples Example 1.1

Automobile Fuse Data......................................................................................1

Example 1.2

Conditional Probabilities............................................................................... 12

Example 1.3

Total Probabilities........................................................................................... 14

Example 1.4

Bayes’ Rule....................................................................................................... 14

Example 1.5

Bayes’ Rule Applied to Misclassified Items................................................ 15

Example 1.6

Probability Expression for CDF.................................................................... 16

Example 1.7

The Uniform Distribution............................................................................. 20

Example 1.8

The Beta Distribution..................................................................................... 21

Example 1.9

Data Simulation............................................................................................... 24

Example 1.10

Data Simulation............................................................................................... 25

Example 2.1

Life Distribution Calculations...................................................................... 30

Example 2.2

System Reliability........................................................................................... 32

Example 2.3

Failure Rate Calculations............................................................................... 35

Example 2.4

Estimating the CDF, Reliability Function, and AFR.................................. 37

Example 2.5

Residual MTTF(T0) Calculation..................................................................... 39

Example 2.6

Multicensored Experimental Data...............................................................44

Example 2.7

Multicensored Field Failure Data.................................................................44

Example 2.8

Left-Truncated Data........................................................................................44

Example 2.9

Left- and Right-Censored Data.....................................................................44

Example 3.1

Exponential Probabilities............................................................................... 48

Example 3.2

Constant Failure Rate..................................................................................... 49

Example 3.3

Exponential Data............................................................................................. 49

Example 3.4

Mean Time to Fail........................................................................................... 52

Example 3.5

Piecewise Exponential Approximation.......................................................54

Example 3.6

Failure Rate and MTTF.................................................................................. 57

Example 3.7

Chi-Square Goodness of Fit.......................................................................... 61

Example 3.8

Goodness-of-Fit Tests Based on EDF Statistics...........................................64

Example 3.9

Confidence Bounds for λ............................................................................... 69

Example 3.10

Zero Failures Estimation............................................................................... 70 xxxi

xxxii

List of Examples

Example 3.11

Confidence Bounds on MTTF....................................................................... 70

Example 3.12

Choosing Sample Sizes.................................................................................. 72

Example 3.13

Choosing the Test Times................................................................................ 73

Example 3.14

Choosing Pass/Fail Criteria.......................................................................... 73

Example 3.15

Minimum Sample Sizes................................................................................. 74

Example 3.16

Minimum Test Times..................................................................................... 74

Example 3.17

Simulating Exponential Data........................................................................ 76

Example 3.18

Fitting a Two-Parameter Exponential Model to Data................................ 76

Example 3.1A

etermining the Sample Size Using Goal Seek D (Example 3.12 revisited)................................................................................. 79

Example 3.2A

Choosing the Test Times................................................................................80

Example 3.3A

Choosing Pass/Fail Criteria.......................................................................... 81

Example 3.4A

KS Test.............................................................................................................. 82

Example 4.1

Weibull Properties.......................................................................................... 89

Example 4.2

Weibull Closure Property.............................................................................. 93

Example 4.3

Rayleigh Radial Error..................................................................................... 97

Example 4.4

MLE for the Exponential................................................................................ 99

Example 4.5

Weibull MLE Parameter Estimation.......................................................... 102

Example 4.6

Weibull MLE Parameter Estimation: Left-Truncated Data..................... 109

Example 5.1

Normal Distribution Calculations............................................................. 125

Example 5.2

Root–Mean–Square Example...................................................................... 128

Example 5.3

Censored Normal Data................................................................................ 132

Example 5.4

Simulation of Cpk Distribution.................................................................... 134

Example 5.5

Lognormal Properties.................................................................................. 137

Example 5.6

Lognormal MLEs and Likelihood Profile Confidence Limits: Exact Times of Failure..................................................................... 142

Example 5.7

ognormal MLEs and Likelihood Profile Confidence L Limits: Interval Data..................................................................................... 144

Example 5.8

Lognormal Calculations.............................................................................. 147

Example 6.1

Linear Equations........................................................................................... 154

Example 6.2

Regression Line............................................................................................. 156

Example 6.3

Linear Rectification....................................................................................... 160

Example 6.4

Probability Plots for Exponential Distribution......................................... 164

Example 6.5

Weibull Probability Plotting: Exact Times................................................ 176

List of Examples

xxxiii

Example 6.6

Lognormal Probability Plot......................................................................... 182

Example 6.7

EDF and Simultaneous Confidence Bounds Calculation........................ 185

Example 6.1A

Order Statistics for Exponential Distribution........................................... 189

Example 6.2A

onfidence Limits on Order Statistics for Exponential C Distribution.................................................................................................... 190

Example 7.1

Kaplan–Meier Product Limit Estimates for Exact Failure Time Data....................................................................................................... 194

Example 7.2

Actuarial Life Table Estimation.................................................................. 209

Example 7.3

Left-Truncated Data...................................................................................... 214

Example 7.4

Left-Censored Data....................................................................................... 217

Example 7.5

Current Status Data...................................................................................... 221

Example 7.6

Estimating F(t) with H-W Confidence Bounds......................................... 226

Example 7.7

Cumulative Hazard Plotting....................................................................... 232

Example 7.1A

Bootstrap Confidence Interval Calculation............................................... 235

Example 8.1

Acceleration Factors for Exponential Distribution.................................. 244

Example 8.2

Weibull Analysis of High-Stress Failure Data.......................................... 245

Example 8.3

Weibull Likelihood Equal-Shapes Test...................................................... 257

Example 8.4

Confidence-Bound Calculation for a Common Weibull Slope.............. 259

Example 8.5

Lognormal Stress-Failure Data................................................................... 260

Example 8.6

Calculation of Acceleration Factor Given ΔH........................................... 268

Example 8.7

Estimating ΔH from Two Temperature Stress Cells of Data.................. 268

Example 8.8

rrhenius Model Analysis Using Both Regression A and MLE Methods........................................................................................ 269

Example 8.9

MLE Analysis of the Six-Stress Cells Given in Example 8.5................... 274

Example 8.10

Calculating Needed Burn-In Time............................................................. 282

Example 8.11

Life Test Experimental Design....................................................................284

Example 8.1A

Weibull Likelihood Equal-Shapes Test...................................................... 287

Example 8.2A

Confidence Bound Calculation for a Common Weibull Slope............... 288

Example 9.1

An Arrhenius Step Stress Experiment....................................................... 302

Example 9.2

An Arrhenius, Power Law Step Stress Experiment................................. 306

Example 9.3

Degradation Data Analysis......................................................................... 311

Example 9.4

ifetime Regression Used to Estimate the Reliability L Effects of Vintage and Plant of Manufacture and Their Significance......................................................................................... 314

xxxiv

List of Examples

Example 9.5

Defect Model................................................................................................. 322

Example 9.6

Maximum Likelihood Estimation for the Defect Model........................ 324

Example 9.7

Multistress Defect Model Example............................................................ 329

Example 9.8

Defect Model Data Treated as Truncated Data......................................... 333

Example 9.1A

JMP’s Nonlinear Modeling Platform......................................................... 335

Example 10.1

Series Systems...............................................................................................346

Example 10.2

Bottom-Up Calculations.............................................................................. 347

Example 10.3

Redundancy Improvement..........................................................................348

Example 10.4

Maximizing Reliability Using Redundancy............................................. 349

Example 10.5

Standby Model.............................................................................................. 351

Example 10.6

xpected Lifetime of k-Out-of-n System of Independent E Exponentially Distributed Components................................................... 353

Example 10.7

Complex System Reduction (Five Components)......................................354

Example 10.8

Complex System Reduction (Six Components)........................................354

Example 10.9

Minimal Path Analysis................................................................................ 357

Example 10.10

Minimal Cut Set Analysis........................................................................... 358

Example 10.11

Minimal Path Analysis When “k-Out-of n” Blocks Are Present........................................................................................ 359

Example 10.12

General Reliability Algorithm.................................................................... 361

Example 10.13

Burn-In Model............................................................................................... 363

Example 10.14

Black Box Testing I........................................................................................ 366

Example 10.15

Black Box Testing II...................................................................................... 366

Example 11.1

Binomial Calculations.................................................................................. 373

Example 11.2

Binomial pmf................................................................................................. 374

Example 11.3

Shortcomings of the Normal Approximation........................................... 379

Example 11.4

Score Confidence Intervals.......................................................................... 380

Example 11.5

Simulation of System Reliability................................................................. 383

Example 11.6

Geometric Distribution................................................................................384

Example 11.7

Negative Binomial Distribution.................................................................. 385

Example 11.8

Hypergeometric Distribution..................................................................... 386

Example 11.9

Poisson Distribution..................................................................................... 391

Example 11.10

Confidence Limits for Expected Value of a Poisson Distribution..................................................................................... 393

List of Examples

xxxv

Example 11.11

Poisson Confidence Limits.......................................................................... 393

Example 11.12

Sampling Plan for Accelerated Stress, Weibull Distribution..................... 409

Example 11.13

Cumulative Count Control Charts for Low PPM..................................... 411

Example 12.1

The Mean Cumulative Function................................................................. 427

Example 12.2

Naive Confidence Limits for the MCF.......................................................430

Example 12.3

Correct Approximate Confidence Limits for the MCF............................430

Example 12.4

Comparison of MCFs for Servers at Two Different Datacenters..................................................................................................... 436

Example 12.5

HPP Probability Estimates.......................................................................... 441

Example 12.6

HPP Estimates in Terms of the MTBF.......................................................442

Example 12.7

Time to kth Repair for HPP Process...........................................................443

Example 12.8

Spare Parts for an HPP.................................................................................444

Example 12.9

Memoryless Property of the Poisson Process...........................................445

Example 12.10

Confidence Bounds on the Population MTBF for an HPP......................... 449

Example 12.11

Test Length Guide for an HPP.................................................................... 449

Example 12.12

ikelihood Ratio Test for Comparison of Two L Exponential MTTFs (Nonrepairable Components)................................. 451

Example 12.13

ikelihood Ratio Test for Comparison of Two HPP L MTBFs (Repairable Systems)....................................................................... 452

Example 12.14

Simulation of 10 Time-Censored HPPs..................................................... 456

Example 12.15

Renewal Data Calculation of CDF.............................................................. 461

Example 12.1A The Cox F-Test............................................................................................... 466 Example 12.2A Renewal Data Calculation of CDF.............................................................. 468 Example 13.1

Laplace Test for Trend versus a Poisson Process...................................... 481

Example 13.2

Reverse Arrangement Test........................................................................... 485

Example 13.3

Fisher’s Composite Test................................................................................ 487

Example 13.4

Nonhomogeneous Poisson Process............................................................ 488

Example 13.5

NHPP with Power Relation Intensity........................................................ 491

Example 13.6

NHPP with Exponential Intensity Model................................................. 498

Example 13.7

Duane Reliability Growth Estimation....................................................... 502

Example 13.8

Confidence Bounds and Modified MLEs.................................................. 505

Example 13.9

Power Relationship Model Reliability Growth........................................508

Example 13.10

Software Reliability Improvement............................................................. 510

Example 13.11

Simulating an NHPP with Power Relation Intensity.............................. 513

xxxvi

List of Examples

Example 13.12

imulating the First Six Repair Times for NHPP with S Specified Power Relation Model................................................................. 514

Example 14.1

Lower MTBF Bounds Using a Bayesian Gamma Prior........................... 521

Example 14.2

Calculating Prior Parameters Using the 50/95 Method.......................... 525

Example 14.3

Calculating a Bayesian Test Time............................................................... 526

Example 14.4

A Minimum Bayesian Testing Time Calculation..................................... 527

Example 14.5

sing Engineering Judgment to Arrive at Bayesian U Prior Parameters........................................................................................... 528

Example 14.6

MTBF Estimate after Test Is Run................................................................ 529

Example 14.7

Bayesian Estimation and Credibility Intervals for p............................... 530

Example 14.8

Bayesian Estimation and Credibility Intervals for the Lognormal T50. .............................................................................................. 532

Example 14.9

Using an Improper Noninformative Prior for Exponential Fail Times.......................................................................................................534

1 Basic Descriptive Statistics One of the most useful skills that a reliability specialist can develop is the ability to convert a mass (mess?) of data into a form suitable for meaningful analysis. Raw numbers by themselves are not useful; what is needed is a distillation of the data into information. In this chapter, we discuss several important concepts and techniques from the field of descriptive statistics. These methods are used to extract a relevant summary from collected data. The goal is to describe and understand the random variability that exists in all measurements of real world phenomena and experimental data. These concepts and techniques are basic and are applied to reliability data throughout the book. The topics we cover include populations and samples; frequency functions, histograms, and cumulative frequency functions; the population cumulative distribution function (CDF) and probability density function (PDF); elementary probability concepts, random variables, population parameters, and sample estimates; theoretical population shape models; and data simulation.

1.1 Populations and Samples Statistics is concerned with variability, and it is a fact of nature that variation exists. No matter how carefully a process is run, an experiment is executed, or a measurement is taken, there will be differences in repeatability due to the inability of any individual or system to completely control all possible influences. If the variability is excessive, the study or process is described as lacking control. If, on the other hand, the variability appears reasonable, we accept it and continue to operate. How do we visualize variability in order to understand if we have a controlled situation? Consider the following example: EXAMPLE 1.1 AUTOMOBILE FUSE DATA A manufacturer of automobile fuses produces lots containing 100,000 fuses rated at 5A. Thus, the fuses are supposed to open in a circuit if the current through the fuse exceeds 5A. Since a fuse protects other elements from possibly damaging electrical overload, it is very important that fuses function properly. How can the manufacturer be assured that the fuses do indeed operate correctly and that there is no excessive variability? Obviously, he cannot test all fuses to the rated limit since that act would destroy the product he wishes to sell. However, he can sample a small quantity of fuses (say, 100 or 200) and test them to destruction to measure the opening point of each fuse. From the sample data, he could then infer what the behavior of the entire group would be if all fuses were tested.

In statistical terms, the entire set or collection of measurements of interest (e.g., the blowing values of all fuses) define a population. A population is the entire set or collection of measurements of interest. 1

2

Applied Reliability

Note that a population may be finite, as in the case of a fuse lot, or it may be infinite, as occurs in a manufacturing process where the population could be all product of a specific type that has been or could ever be produced in a fabricating area. The sample (e.g., the 100 or 200 fuses tested to destruction) is a subset of data taken from the population. A sample is a subset of data from the population. The objective in taking a sample is to make inferences about the population. Note that reliability data commonly exists in one of two forms. In variables data, the actual measurement of interest is continuous, such as time in minutes, length in inches, or temperature in degrees Celsius. In attributes data, the measurements are quantified into discrete categories such as pass or fail, go or no go, in spec or out of spec, small or medium or large, and so on. Attributes data includes counts, proportions, and percentages. Although both types of data are discussed in this text, applications and analysis of attributes data is treated extensively in Chapter 11. In the fuse data example, we record variables data, but we could also transform the same results into attributes data by stating whether a fuse opened before or after the 5A rating. Similarly, in reliability work one can measure the actual failure time of an item (variables data) or record the number of items failing before a fixed time (attributes data). Both types of data occur frequently in reliability studies. In Chapter 3, Sections 3.10–3.12, we will discuss such topics as choosing a sample size, drawing a sample randomly, and the “confidence” in the data from a sample. For now, however, let’s assume that the sample has been properly drawn and consider what to do with the data in order to present an informative picture.

1.2 Histograms and Frequency Functions In stating that a sample has been randomly drawn, we imply that each measurement or data point in the population has an equal chance or probability of being selected for the sample. If this requirement is not fulfilled, the sample may be “biased” and correct inferences about the population might not be possible. What information does the manufacturer expect to obtain from the sample measurements of 100 fuses? First, the data should cluster about the rated value of 5A. Second, the spread in the data (variability) should not be large, because the manufacturer realizes that serious problems could result for the users if some fuses blow at too high a value. Similarly, fuses opening at too low a level could cause needless repairs or generate unnecessary concerns. The reliability analyst randomly samples 100 fuses and records the data shown in Table 1.1. It is easy to determine the high and low values from the sample data and see that the measurements cluster roughly about the number 5. Yet, there is still difficulty in grasping the full information contained in this set of data. Let’s apply the following procedure:

1. Find the range of the data by subtracting the lowest from the highest value. For this set, the range is 5.46 − 4.43 = 1.03. 2. Divide the range into ten or so equally spaced intervals such that readings can be uniquely classified into each cell. Here, the cell width is 1.03/10 ≈ 0.10, and we choose the starting point to be 4.395, a convenient value below the minimum of the data and carried out one digit more precise than the data to avoid any confusion in assigning readings to individual cells. Note that the terms “bin,” “class,”

3

Basic Descriptive Statistics

or “cell” are used interchangeably in the literature and also by statistical software programs to denote one of these equally spaced intervals. 3. Increment the starting point by multiples of the cell width until the maximum value is exceeded. Thus, since the maximum value is 5.46, we generate the numbers 4.395, 4.495, 4.595, 4.695, 4.795, 4.895, 4.995, 5.095, 5.195, 5.295, 5.395, and 5.495. These values represent the endpoints or boundaries of each cell, effectively dividing the range of the data into equally spaced class intervals covering all the data points. 4. Construct a frequency table as shown in Table 1.2, which gives the number of times a measurement falls inside a class interval. 5. Make a graphical representation of the data by sketching vertical bars centered at the midpoints of the class cells with bar heights proportionate to the number of values falling in that class. This graphical representation shown in Figure 1.1 is called a histogram.

A histogram is a graphical representation in bar chart form of a frequency table or frequency distribution. The vertical axis in a histogram may represent the actual count in a cell, or it may state the percentage of observations of the total sample that occur in a cell. Also, the range here is divided by the number 10 to generate a cell width, but any convenient number TABLE 1.1 Sample Data on 100 Fuses 4.64 4.98 4.73 5.21 5.06 5.04 5.03 5.46 5.01 4.77

4.95 4.93 5.37 5.11 4.69 5.04 5.05 5.12 4.94 4.58

5.25 4.72 4.81 5.15 5.14 4.78 4.94 5.12 5.02 5.18

5.21 5.07 5.19 5.28 4.83 4.96 5.02 4.85 5.16 5.03

4.90 4.80 4.77 5.20 4.78 4.94 4.43 5.05 4.88 5.10

4.67 4.98 4.79 4.73 4.72 5.24 4.91 5.26 5.10 4.67

4.97 4.66 5.08 5.32 5.21 5.22 4.84 5.01 4.80 5.21

TABLE 1.2 Frequency Table of Fuse Data Cell Boundaries 4.395–4.495 4.495–4.595 4.595–4.695 4.695–4.795 4.795–4.895 4.895–4.995 4.995–5.095 5.095–5.195 5.195–5.295 5.295–5.395 5.395–5.495 Total count

Number in Cell 2 2 8 15 14 13 16 15 11 3 1 100

4.92 4.43 5.07 4.79 5.02 5.00 4.75 4.64 5.10 4.73

4.87 4.78 4.65 5.10 4.89 4.60 4.88 4.86 5.20 4.88

5.11 4.53 5.39 4.94 5.19 4.88 4.79 4.73 5.11 4.80

4

Applied Reliability

16

15

14

15

15

13

10

8

2 4.4

4.5

5

3

2

Count

11

1 4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

FIGURE 1.1 Histogram of fuse breakdown measurements.

(usually between 8 and 20) may be used. Too small a number may not reveal the shape of the data and too large a number can result in many empty cells and a flat-appearing distribution. Sometimes, a few tries are required to arrive at a suitable choice. There is a useful Excel spreadsheet function called FREQUENCY that will generate a frequency table such as that shown in Table 1.2. Say the 100 fuse breakpoints are entered into column A, cells A1–A100, and the 12 cell boundaries, starting with 4.395 and ending with 5.495, are entered into column B, cells B1–B12. Next, we highlight (click and drag) into an empty column, say C, 13 blank rows in cells C1–C13. Then we type in the function = FREQUENCY(A1:A100,B1:B12). The expression is evaluated as a matrix operation by pressing the keys Ctrl+Shift+Enter together instead of pressing just Enter alone. This action produces the Table 1.2 frequencies in rows C2–C12 (C1 contains counts up to 4.395 and C13 contains counts after 5.495). The FREQUENCY function in OpenOffice software works the same way, except that a semicolon is used between arguments instead of a comma. EXERCISE 1.1 Use Excel or OpenOffice to generate the frequencies given in Table 1.2, using the Table 1.1 sample data and the same interval endpoints as used in Table 1.2.

In summary, the histogram provides us with a picture of the data from which we can intuitively see the center of the distribution, the spread, and the shape. The shape is important because we usually have an underlying idea or model as to how the entire population should look. The sample shape either confirms this expectation or gives us reason to question our assumptions. In particular, a shape that is symmetric about the center, with most of the observations in the central region, might reflect data from certain symmetric distributions such as the normal or Gaussian distribution. Alternatively, a nonsymmetric appearance would imply the existence of data points spaced farther from the center in one direction than in the other, which could lead to the consideration of a distribution, such as a Weibull or lognormal. For the data presented in Example 1.1, we note that the distribution appears reasonably symmetric. Hence, based on the histogram and the way the ends of the distribution taper off, the manufacturer believes that values much greater or much less than about 10% of the central target are not likely to occur. This variability is accepted as reasonable.

5

Basic Descriptive Statistics

1.3 Cumulative Frequency Function There is another way of representing the data that can be very useful. By reference to Table 1.2, let us accumulate the number of observations less than or equal to each upper cell boundary as shown in Table 1.3. This representation of the data is called a cumulative frequency function. The graphical rendering of the cumulative frequency function is shown in Figure 1.2. Note that the cumulative frequency distribution is never decreasing—it starts at zero and reaches the total sample size. It is often convenient to represent the cumulative count in terms of a fraction or percentage of the total sample size used. In that case, the cumulative TABLE 1.3 Cumulative Frequency Function for Fuse Data Upper Cell Boundary (UCB) 4.495 4.595 4.695 4.795 4.895 4.995 5.095 5.195 5.295 5.395 5.495

Number of Observations Less than or Equal to UCB 2 4 12 27 41 54 70 85 96 99 100

100 90

Percentage (or fraction)

80 70 60 50 40 30 20 10 0 4.40 4.50 4.60 4.70 4.80 4.90 5.00 5.10 5.20 5.30 5.40 5.50 Measured value

FIGURE 1.2 Plot of cumulative frequency function.

6

Applied Reliability

frequency function ranges from 0 to 1.00 in fractional representation or 0% to 100% in percentage notation. In this text, we often employ the percentage form. Table 1.3 and Figure 1.2 show that the cumulative frequency curve is obtained by summing the frequency function count values. This summation process will be generalized by integration when we discuss the population concepts underlying the frequency function and the cumulative frequency function in Section 1.4.

1.4 The Cumulative Distribution Function and the Probability Density Function The frequency distribution and the cumulative frequency distribution are calculated from sample measurements. Since the samples are drawn from a population, what can we state about this population? The typical procedure is to assume a mathematical formula that provides a theoretical model for describing the way the population values are distributed. The sample histograms and the cumulative frequency functions are the estimates of these population models. The model corresponding to the frequency distribution is the PDF denoted by f ( x), where x is any value of interest. The PDF may be interpreted in the following way: f ( x)dx is the fraction of the population values occurring in the interval dx. In reliability work, we often have the failure time t as the variable of interest. Therefore, f (t)dt is the fraction of failure times of the population occurring in the interval dt. A very simple example for f (t) is the exponential distribution, given by the equation f(t) = λe−λt, 0 ≤ t < ∞

where λ is a constant. The plot of f (t) is shown in Figure 1.3. The exponential distribution is a widely applied model in reliability studies and forms the basis of Chapter 3.

f(t) λ

t FIGURE 1.3 Plot of PDF for the exponential distribution.

7

Basic Descriptive Statistics

The cumulative frequency distribution similarly corresponds to a population model called the CDF and is denoted by F( x). The CDF is related to the PDF via the relationship x

F ( x) =

∫ f (y)dy

−∞

where y is the dummy variable of integration. F( x) may be interpreted as the fraction of values less than or equal to x in the population. Alternatively, F( x) gives the probability of a value less than or equal to x occurring in a single random draw from the population described by F( x). Since in reliability work we usually deal with failure times, t, which are nonnegative, the CDF for population failure times is related to the PDF by t

F(t) =

∫ f (y)dy 0

For the exponential distribution, t

t

F(t) =

∫ λe

− λy

dy = −e− λy ] = 1 − e− λt

0

0

The CDF for the exponential distribution is plotted in Figure 1.4. When we calculated the cumulative frequency function in the fuse example, we worked with grouped data (i.e., data classified by cells). However, another estimate of the population CDF can be generated by ordering the individual measurements from smallest to largest, and then plotting the successive fractions 1 2 3 n , , , ..., n n n n

F (t) 100%

0 FIGURE 1.4 CDF for exponential distribution.

t

8

Applied Reliability

1.0 0.9 0.8 Estimated F (x)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 4.4

4.5

4.6

4.7

4.8

4.9 5 5.1 Measured value

5.2

5.3

5.4

5.5

FIGURE 1.5 EDF for fuse data.

versus the ordered data points. Such a representation is called the empirical distribution function (EDF) and is shown in Figure 1.5 for the data from the fuse example. Note that the EDF steps up by 1/n at each data point and remains constant until the next point. The advantage of using the EDF instead of grouping the data is obviously that all data points are pictured; the disadvantage is that more computational effort is involved. However, spreadsheet software can easily perform the calculations and plotting. See Appendix 1A for a method to create step charts using spreadsheet plots. Since F( x) is a probability, all the rules and formulas for manipulating probabilities can be used when working with CDFs. Some of these basic rules are described in Section 1.5. EXERCISE 1.2 For the data in Table 1.1, construct a frequency table using 4.395 as the starting point and 0.2 as the interval width. Create a histogram of this frequency table. Compare it to Figure 1.1. EXERCISE 1.3 Using the results from Exercise 1.2, construct a cumulative frequency table and create a plot of the cumulative frequency function. How does it compare to Figure 1.2? EXERCISE 1.4 Take columns 2, 5, and 8 (left to right) from Table 1.1, for a total of 30 data points. Assume a random sample, arrange the points in order from smallest to largest, and plot the EDF. Compare to Figure 1.5. EXERCISE 1.5 The histogram shown in Figure 1.1 was generated using JMP software. Use the JMP Analyze, Distribution platform and the data set shown in Table 1.1 (FuseData.jmp at the “Downloads & Updates” tab of the web page for this book at http://www .crcpress.com/product/isbn/9781584884668) to generate a histogram. The result may

9

Basic Descriptive Statistics

Histogram of fuse data 18 16 14

Frequency

12 10 8 6 4 2 0

4.4

4.6

4.8

5.0 Fuse data

5.2

5.4

FIGURE 1.6 Minitab histogram of fuse data.

be somewhat different based on a JMP default choice for the number of bins. If so, use the “hand” cursor to change the number of bins until the same histogram as shown in Figure 1.1 is obtained. Next, use the histogram spreadsheet function of Excel (in Tools, Data Analysis) to obtain a histogram. The Table 1.1 data set for spreadsheet use is FuseData.xls. In order to get the exact same graph shown in Figure 1.1, input bin numbers starting with 4.495 and increasing in steps of 0.1 to 5.495. Finally, use Minitab to get the histogram shown in Figure 1.6 (one must change the default number of bins from 12 to 11). NO T E :

There is nothing “wrong” or “misleading” with obtaining dissimilar histograms from different software programs. It is up to the analyst to vary the bin numbers and locations to obtain one of many reasonable (although varying) views of the data. For example, the default output histogram from Minitab is shown in Figure 1.6.

1.5 Probability Concepts In the classical sense, the term probability can be thought of as the expected relative frequency of occurrence of a specific event in a very large collection of possible outcomes. For example, if we toss a balanced coin many times, we expect the number of occurrences of the event “heads” to comprise approximately half of the number of outcomes. Thus, we say the probability of heads on a single toss is 0.5, 50%, or 50–50. It is typical to express probabilities either as a fraction between 0 and 1 or as a percentage between 0% and 100%. There are two very useful relations often invoked in probability theory. These rules relate to the occurrence of two or more events. In electrical engineering terms, we are defining “and” and “or” relations. The first rule states that if P(A) is the probability of event

10

Applied Reliability

A occurring and P(B) is the probability of event B occurring, then the probability of events A and B occurring simultaneously, denoted P(AB), is

P( AB) = P( A)P(B|A)

or P( AB) = P(B)P( A|B)

where P(A|B) designates the “conditional” probability of A given that event B has occurred. Let’s explain conditional probability further. We imply by the terminology that one event may be affected by the occurrence of another event. For example, suppose we ask what the probability is of getting two black cards in a row in successive draws from a well-shuffled deck of cards, without replacing the first card drawn. Obviously, the probability of the first card being a black card (call this event A) is

P( A) =

favorable outcomes 26 1 = = total outcomes 52 2

The probability of the second card being a black card (event B) changes depending on whether or not the first card drawn is a black card. If yes, then the probability of the second card being a black card is

P ( B|A ) =

25 51

Therefore, the probability of two successive black cards is P( AB) = P(B)P( A|B) 1 25 2 51 25 = 102 =

Two events, A and B, are said to be independent if the occurrence of one does not affect the probability of the other occurrence. The formal definition states that two events A and B are independent if and only if

P( AB) = P( A)P(B)

This expression is sometimes referred to as the multiplication rule for the probability of independent events occurring simultaneously. In general, the probability of independent events occurring is just the product of the individual probabilities of each event. For example, in the card situation, replacing the first card drawn and reshuffling the deck will make event B independent of event A. Thus, the probability of two successive black cards, with replacement and reshuffling between draws, is 26 26 52 52 1 = 4

P( AB) = P( A)P(B) =

11

Basic Descriptive Statistics

Similarly, the probability of simultaneously getting a 6 on one roll of a die and an ace in one draw from a deck of cards, apparently independent events, is 1 4 6 52 1 = 78

P( AB) =

The extension of these conditional probability principles to three or more events is possible. For example, the rule for the joint probability of three events, A, B and C, is

P( ABC) = P( A)P(B|A)P(C|AB) For independent events, the formula becomes

P( ABC) = P( A)P(B)P(C)

The second important probability formula relates to the situation in which either of two events, A or B, may occur. The expression for this “union” is

P( A ∪ B) = P( A) + P(B) − P( AB) If the events are independent, then the relation becomes

P( A ∪ B) = P( A) + P(B) − P( A)P(B)

The last term in the above expressions corrects for double counting of the same outcomes. For example, what is the probability of getting either an ace (event A) or a black card (event B) in one draw from a deck of cards? The events are independent (see Exercise 1.6.), and therefore P( A ∪ B) = P( A) + P(B) − P( A)P(B) 4 26 4 26 + − 52 52 52 52 14 7 = = 52 13

=

Note that the term P(A)P(B) subtracts out the probability for black aces. This probability has already been counted twice, once in the P(A) term and once in the P(B) term. When events A and B are mutually exclusive or disjoint, that is, both events cannot occur simultaneously, then P(AB) = 0, and

P( A ∪ B) = P( A) + P(B)

Furthermore, if both events are exhaustive in that at least one of them must occur when an experiment is run, then

P( A ∪ B) = P( A) + P(B) = 1

12

Applied Reliability

Thus, event A is the complement of event B. Event B can be viewed as the nonoccurrence of A and designated as event A. Hence, the probability of occurrence of any event is equal to one minus the probability of occurrence of its complementary event. This complement rule has important applications in reliability work because a component may either fail (event A) or survive (event A), resulting in

P(Failure) = 1 − P(Survival)

As another example, we note that the event “at least one occurrence” and the event “zero occurrences” are mutually exclusive and exhaustive events. Therefore, the probability of at least one occurrence is equal to 1 – probability of no occurrences. An extension to three or more events is also possible. For three events A, B, and C, the formula is

P( A ∪ B ∪ C) = P( A) + P(B) + P(C) − P( AB) − P(BC) − P( AC) + P( ABC) For independent events, the relation becomes

P( A ∪ B ∪ C) = P( A) + P(B) + P(C) − P( A)P(B) − P(B)P(C) − P( A)P(C) + P( A)P(B)P(C) For mutually exclusive, exhaustive events, we have

P( A ∪ B ∪ C) = P( A) + P(B) + P(C) = 1

For four events, we begin by adding the four single-event probabilities. Then, we subtract the six possible probabilities of two events occurring simultaneously. Next, we add back in the four possible probabilities of three events occurring simultaneously. Finally, we subtract the probability of all four events occurring simultaneously. This “in and out” procedure works for any number of events, and the total number of terms in the final expression when there are n events will be 2n − 1. EXAMPLE 1.2 CONDITIONAL PROBABILITIES A tricky word problem that appears often in many forms can be stated as follows: A computer hack visits the surplus store and sees two similar hard drives displayed. The sign says, “Specially Reduced: 50–50 Chance of Working.” He asks the dealer whether the hard drives operate properly. The dealer replies “at least one of them is working.” What is the probability that both drives are functional? Does the probability change if the dealer says “the one on the left works”?

Solution The first question asks for the probability that both drives work, given that at least one is working, that is, P(both work | at least one works). Let A be the event “both drives work” and let B be the event “at least one drive works.” We want the P(A|B). From our conditional probability formula, we can rewrite the expression as follows:

P ( A|B) =

P( AB) P (B )

13

Basic Descriptive Statistics

TABLE 1.4 Possible Outcomes for Drives No Dealer Information Left Drive

At Least One Works

Left Drive Works

Right Drive

Left Drive

Right Drive

Left Drive

Right Drive

W

W

W

W

W

W

N

W

N

W

N

N

W

N

W

N

N

W

Probability that both drives work 1/4

1/3

1/2

Now P(AB) is the probability that both drives work (event A) and at least one drive works (event B). This joint event is actually the same as the probability of event A alone since event A includes event B; that is, if both drives work, then at least one works. Therefore, P(AB) = P(A) = ([0.5][0.5]) = 0.25, assuming the drives are independent. Since the event B “at least one drive works” and the event “both drives are not working” are mutually exclusive and exhaustive events, the denominator P(B) = P(at least one works) = 1 − P(both not working) = 1 − (0.5)(0.5) = 0.75. Hence, the desired probability is P(A|B) = (0.25) / (0.75) = 1/3. This result surprises many individuals who incorrectly assume that the conditional probability of two working drives given at least one works should be 1/2, instead of the correct answer 1/3, since they reason that the other disk drive is equally likely to work or not work. However, the sample space of possible outcomes is listed in Table 1.4. With no dealer information, there are four equally likely outcomes: (work, work), (work, not work), (not work, work), and (not work, not work), for the left and right drives, respectively. Thus, the probability is only 1/4 that both drives work. When we are told that at least one drive works, we eliminate the outcome (not work, not work). Therefore, we have only three equally likely outcomes remaining: (work, work), (work, not work), and (not work, work). Consequently, the probability that both drives work has increased from 1/4 to 1/3 with the added data. Alternatively, the probability that at least one of the drives does not work has decreased from 3/4 to 2/3. On the other hand, if the dealer points out the working drive (maybe he did not have the time to test both drives), the probability that both drives work does change. Let event A be “both drives work” and C be “the left drive works.” Now, P(A|C) = (0.5)/(1) = 0.5. In this case, there are only two possible outcomes (work, work) and (work, not work), where the first position indicates the left drive, and only one outcome of the two has both drives working.

For a set of events, E1 , E2 ,..., Ek, that are mutually exclusive and exhaustive, another useful relationship, sometimes called the law of total probabilities, applies. Any event A can be written as follows: k

P( A) = ∑ P( A E j )P(E j ) j=1

In words, P(A) is the weighted average of conditional probabilities, each weighted by the probability of the event on which it is conditioned. This expression is often easier to calculate than P(A) directly.

14

Applied Reliability

EXAMPLE 1.3 TOTAL PROBABILITIES A computer manufacturer purchases equivalent microprocessor components from three different distributors. The assembly of each computer utilizes one microprocessor, randomly chosen from an in-house inventory. Typically, the inventory consists of 30% of this component type from distributor A, 50% from distributor B, and 20% from distributor C. Historical records show that components from distributors A and C are twice as likely to cause a system failure as those from distributor B. The probability of system failure with component B is 0.5%. What is the probability that a computer system will experience failure?

Solution Since there are three distributors and we randomly chose a component from one of the distributors, we have three mutually exclusive and exhaustive events. The theorem of total probability is the basis for the solution: P(failure ) = P(failure | distributor A)P(distributor A) + P(failure | distributor B) P(distributor B)

+ P(failure | distributor C )P(distributor C) = 2(.005)(.3) + (.005)(.5) + 2(.005)(.2) = .0075 or .75%

A final key probability formula, known as Bayes’ rule, allows us to “invert” conditional probabilities, that is, determine which one of the conditioning events Ej is likely to have occurred, given that event A has occurred. Again, for a set of mutually exclusive and exhaustive events, E1 , E2 ,..., Ek , Bayes’ rule states that

P(E j ) =

P( A|E j )P(E j ) k

∑ P(A|E )P(E ) j

j

j=1

Note that by the law of total probabilities, the denominator of this expression is just P(A). EXAMPLE 1.4 BAYES’ RULE The probability that a batch of incoming material from any supplier is rejected is 0.1. Typically, material from supplier S1 is rejected 8% of the time, from supplier S2, 15%, and S3, 10%. We know that 50% of the incoming material comes from S1, 20% from supplier S2, and 30% from S3. Given that the latest lot of incoming material is rejected, what is the probability the supplier is S1?

Solution Let A denote the event that the batch is rejected. Then, by Bayes’ rule,

P( S1| A) =

P( A| S1)P( S1) (0.08)(0.5) = 0.4 = P( A) (0.08)(0.5) + (0.15)(0.2) + (0.1)(0.3)

In this example, the starting (i.e., before we know the batch is rejected) probability of the event S1 is 0.5. This knowledge is sometimes referred to as the “a priori” probability of S1. After the batch rejection, Bayes’ rule allows us to calculate the new (conditional) probability of S1 as .4. The result is sometimes called the a posteriori probability of S1.

15

Basic Descriptive Statistics

EXAMPLE 1.5 BAYES’ RULE APPLIED TO MISCLASSIFIED ITEMS Assume we perform a test on a component to check for a specific defect. Historically, 1% of the components have this defect. Based on a detailed analysis of previous results, 95% of the components with the defect are detected, but 8% of the components without the defect are wrongly categorized as defective. If a component is classified as defective, what is the probability that the component actually has the defect? What is the probability that a component with a negative test has the defect?

Solution We first solve this problem by an approach using simple average calculations. Consider a test of 2000 components. On the average, 1%, or 20, of the total components will have the defect. Of those with the defect, 95%, or 19, will be caught. However, of the 1980 without the defect, 8%, or 158, will have a false positive and be called defective. Therefore, 19 out of the 19 + 158 = 177 classified as defective will have the defect, and the probability of actually having the defect and a positive test result is 19/177 = 11%. This result shows that it may be a good idea to do a more extensive retest of rejected components and recover from the false positives. Also, the probability of having the defect and a negative test result is 1/(1000 − 177) = 1/1823 = .055%, which is about 1/18 of the prior probability. Next, we get the same result using the Bayes’ rule formula. Let A denote a positive test and E1 denote a defective unit. Also, let E2 represent a unit with no defect. We want

P(E1| A) = P(E1A)/P( A) = P( A|E1)P(E1)/[P( A|E1)P(E1) + P( A|E2 )P(E2 )] = .95(.01)/[.95(.01) + .08(.99)] = .0095/.0887 = 11% Let B = negative result. The probability of having the defect E1 given a negative result of B is

P(E1 | B) = P(E1B)/P(B) = P(B| E1)P(E1)/[P(B| E1)P(E1) + P(B| E2 )P(E2 )] = .05(.01)/[.05(.01) + .92(.99)] = .0005/.9113 = .055%

EXERCISE 1.6 From a well-shuffled deck of cards, let drawing an ace event A and let drawing a black card be event B. Determine P(AB), the probability of getting a black ace in a single draw, and show that events A and B are independent. EXERCISE 1.7 Three assembly plants produce the same parts. Plant A produces 25% of the volume and has a shipment defect rate of 1%. Plant B produces 30% of the volume and ships 1.2% defectives. Plant C produces the remainder and ships 0.6% defectives. Given that a component picked at random from the warehouse stocked by these plants is defective, what are the probabilities that it was manufactured by plant A or B or C? EXERCISE 1.8 An electronic card contains three components: A, B, and C. Component A has a probability of .02 of failing in 3 years. Component B has a probability of .01 of failing in 3 years and component C has a probability of .10 of failing in 3 years. What is the probability that the card survives 3 years without failing? What assumptions were made for this calculation?

16

Applied Reliability

1.6 Random Variables In reliability studies, the outcome of an experiment may be numerical (e.g., time to failure of a component) or the result may be other than numerical (e.g., type of failure mode associated with a nonfunctional device). In either case, analysis is made possible by assigning a number to every point in the space of all possible outcomes—called the sample space. Examples of assigning numbers are as follows: the time to failure is assigned the elapsed hours of operation, and the failure mode may be assigned a category number 1, 2, and so on. Any rule for assigning a number creates a random variable. A random variable is a function for assigning real numbers to points in the sample space. The practice is to denote the random variable by a capital letter (X, Y, Z, etc.) and the realization of the random variable (i.e., the real number or piece of sample data) by the lower case letter (x, y, z, etc.). Since this definition appears a bit abstract, let us consider a simple example using a single die with six faces, each face having one to six dots. The experiment consists of rolling the die and observing the upside face. The random variable is denoted X, and it assigns numbers matching the number of dots on the side facing up. Thus, (X = x) is an event in the sample space, and X = 6 refers to the realization where the face with six dots is the side up. It is also common to refer to the probability of an event occurring using the notation P(X = x). In this example, we assume all six possible outcomes are equally likely (fair die), and therefore, P(X = x) = 1/6 for x = 1, 2, 3, 4, 5, or 6. EXAMPLE 1.6 PROBABILITY EXPRESSION FOR CDF The CDF F(x) can be defined as F (x ) = P ( X ≤ x ) , that is, F(x) is the probability that the random variable X has a value less than or equal to x. Similarly, the survival function can be defined as S(x) = 1 − F(x) = P(X > x).

1.7 Sample Estimates of Population Parameters We have discussed descriptive techniques such as histograms to represent observations. However, in order to complement the visual impression given by the frequency histogram, we often employ numerical descriptive measures called parameters for a population and statistics for a sample. These measures summarize the data in a population or sample and also permit quantitative statistical analysis. In this way, the concepts of central tendency, spread, shape, symmetry, and so on take on quantifiable meanings. For example, we state that the frequency distribution is centered about a given value. This central tendency can be expressed in several ways. One simple method is just to cite the most frequently occurring value, called the mode. For grouped data, the mode is the midpoint of the interval with the highest frequency. For the fuse data in Table 1.1, the mode is 5.05. Another procedure involves selecting the median, that is, the value that effectively divides the data in half. For individual readings, the n data points are first ranked in order, from smallest to largest, and the median is chosen according to the following algorithm: the middle value if n is odd, and the average of the two middle values if n is even. Alternatively, for data that has already been grouped or binned (Table 1.2), the median occurs in the interval for which the cumulative frequency distribution registers 50%; that is, a vertical line through the median divides the histogram into two equal areas. For

17

Basic Descriptive Statistics

grouped data with n points, to get the median, one first determines the number of observations in the class containing the middle measurement n/2 and the number of observations in the class to get to that measurement. For example, for the fuse data in Table 1.3, n = 100, and the middle value is the 50th point, which occurs in the class marked 4.895 to 4.995 (width 0.1). There are 41 data points before the interval and 13 points in this class. We must count 9/13 of the interval width to get to the median. Hence, the median is

4.895 +

9 × 0.1 = 4.964 13

(In reliability work, it is common terminology to refer to the median as the T50 value for time to 50% failures.) The most common measure of central tendency, however, is called the arithmetic mean or average. The sample mean is simply the sum of the observations divided by the number of observations. Thus, the mean, denoted by X, of n readings is given by the statistic X=

X1 + X 2 + X 3 + + X n n n

=

∑X

i

i=1

n

This expression is called a statistic because its value depends on the sample measurements. Thus, the sample mean will change with each sample drawn, which is another instance of the variability of the real world. In contrast, the population mean depends on the entire set of measurements, and thus it is a fixed quantity, which we call a parameter. The sample mean X estimates the population mean . We also mention here a notation common in statistics and reliability work. A parameter estimate is commonly denoted by a caret (^) over the parameter symbol. Thus, ˆ is an estimate of the population mean μ and here ˆ = X. For a discrete (i.e., countable) population, the mean is just the summation over all discrete values where each value xi is weighted by the probability of its occurrence pi:

∑x p

=

i i

i

For a continuous population, the mean parameter is expressed in terms of the PDF model as ∞

=

∫ x f ( x) dx

−∞

For reliability work involving time, the population mean is ∞

=

∫ t f (t ) dt 0

18

Applied Reliability

An alternate expression for the mean of a lifetime distribution is sometimes easier to evaluate. The form of the equation, when a finite mean exists, is ∞

=

∫

1 − F ( t ) dt

0

(See Feller 1968, page 148, for a proof.) A common practice in statistics is to refer to the mean for both discrete and continuous random variables as the expected value of the random variable and use the notation E(X) = μ or E(T) = μ. We occasionally use this terminology in this text. Knowing the center of the distribution is not enough; we are also concerned about the spread of the data. The simplest concept for variability is the range, the difference between the highest and lowest readings. However, the range does not have very convenient statistical properties, and therefore, another measure of dispersion is more frequently used. This numerical measure of variation is called the variance. The variance has certain statistical properties that make it very useful for analysis and theoretical work. The variance of a random variable X is defined as the expected value of (X − μ)2, that is, V(x) = E([X − μ]2). An alternative formula is V(x) = E[X2] − μ2. For continuous data, the population variance for common reliability analysis involving time is ∞

V (t ) = σ =

2

∫ (t −

)2 dt

0

In engineering terms, we see that the variance is the expected value of the second moment about the mean. The square root of the variance is called the standard deviation. The standard deviation is expressed in the same units as the observations. The sample standard deviation is denoted by s and the formula is n

∑(X − X )

2

i

s=

i=1

n−1

Since X is used in the formula rather than the population mean, statistical theory shows that dividing by n − 1 gives a better (i.e., unbiased) estimate of the population variance (denoted by σˆ 2 = s2) than just dividing by n. Alternatively, we may state that one degree of freedom has been taken to estimate the population mean μ using X. We have defined numerical measures of central tendency X, and dispersion ( s2 , σ 2 ). It is also valuable to have a measure of symmetry about the center and a measure of how peaked the data is over the central region. These measures are called skewness and kurtosis, and are respectively defined as expected values of the third and fourth moments about the mean, that is,

(

skewness:

3

= E[(X − )3 ]; kurtosis:

4

)

= E[(X − )4 ]

Symmetric distributions have skewness equal to zero. A unimodal (i.e., single peak) distribution with an extended right “tail” will have positive skewness and will be referred to as skewed right; skewed left implies a negative skewness and a corresponding extended left tail. For example, the exponential distribution in Figure 1.3 is skewed right. Kurtosis, on the other hand, indicates the relative flatness of the distribution or how “heavy” the tails are.

19

Basic Descriptive Statistics

Both measures are usually expressed in relative (i.e., independent of the scale of measurement) terms by dividing 3 by σ 3 and 4 by σ 4. The kurtosis estimate is also offset by an amount that goes to three as the sample size increases so that data from a normal population has a kurtosis of approximately zero. Sample estimates are calculated using the formulas 3

n

∑ (x − x ) i

n Skewness estimate = (n − 1)(n − 2)

i=1

s

4

n

n(n − 1) Kurtosis estimate = (n − 1)(n − 2)(n − 3)

∑ (x − x ) i

i=1

s

−3

(n − 1)2 (n − 2)(n − 3)

These formulas are used by the spreadsheet SKEW and KURT functions and also by Minitab and JMP in their descriptive statistics platforms. These various measures allow us to check the validity of the assumed model. Ott (1977) shows applications to the normal distribution. Table 1.5 contains a listing of properties of distributions frequently used in reliability studies. The important statistical concept involved in sample estimates of population parameters (e.g., mean, variance, etc.) is that the population parameters are fixed quantities, and we infer what they are from the sample data. For example, the fixed constant θ in the exponential model F ( t ) = 1 − e− t/θ, where θ = 1/λ , can be shown to be the mean of the distribution of failure times for an exponential population. The sample quantities, on the other hand, are random statistics that may change with each sample drawn from the population. TABLE 1.5 Properties of Distributions Used in Reliability Studies

Uniform

Normal

Weibull

Exponential Lognormal

Symmetric Bell-shaped Skewed

Yes No No Skew = 0

Yes Yes No Skew = 0

No No No No Yes (right) Yes (right) Skew = 2

Kurtosis Log data is symmetric and bellshaped Cumulative distribution shape

−1.8 No

0 No

No

Straight line

“S” shaped

6 No

Exponential curve

No No Yes (right)

Yes

Rayleigh (Weibull with Shape Parameter 2 and Extreme Linear Failure Rate) Value No No Yes (right) Skew = 0.63 0.26 No

No No Yes (left) Skew = −1.14 2.4 No

20

Applied Reliability

EXAMPLE 1.7 THE UNIFORM DISTRIBUTION The uniform distribution is a continuous distribution with PDF for the random variable T given by

f (t ) =

1 , θ1 ≤ t ≤ θ 2 θ 2 − θ1

and zero elsewhere, where θ1 and θ 2 are the parameters specifying the range of T. The rectangular shape of this distribution is shown in Figure 1.7. We note that f(t) is constant between θ1 and θ 2 . The CDF of T, denoted by F(t), for the uniform case is given by

F(t ) =

t − θ1 θ 2 − θ1

Thus, F(t) is linear in t in the range θ1 ≤ t ≤ θ 2, as shown in Figure 1.8. EXERCISE 1.9 Show that the uniform distribution has expected value E ( t ) = ( θ1 + θ2 ) 2 and variance V (t ) =

( θ 2 − θ 1 )2 12

f(t)

1 θ2 − θ1

θ1

t

θ2

FIGURE 1.7 The uniform PDF. F (t) 1

θ1 FIGURE 1.8 The CDF for the uniform distribution.

θ2

t

21

Basic Descriptive Statistics

EXERCISE 1.10 The uniform distribution defined on the unit interval [0,1] is a popular and useful model—so much so that the name uniform distribution is often taken to refer to this special case. Find f(u), F(u), E(u), and V(u) for this distribution. EXERCISE 1.11 −1

Let F ( t ) = 1 − ( 1 + t ) , 0 ≤ t ≤ ∞. This is a legitimate CDF that goes from 0 to 1 continuously as t goes from 0 to ° . Find the PDF and the T50 for this distribution. Try to calculate the mean. (Hint: Use either integration by parts or the alternate formula given in the text for calculating the mean.) EXAMPLE 1.8 THE BETA DISTRIBUTION The (standard) beta distribution, like the uniform distribution discussed in Exercise 1.10, is also defined on the unit interval [0,1]. However, it is a far more flexible distribution and even includes the uniform distribution as a special case. Its flexibility is one of the reasons it is an excellent choice for modeling numbers between 0 and 1, such as probabilities or proportions. For a random variable X having a beta distribution with parameters a > 0 and b > 0, the PDF in the unit interval is given by

f ( x) =

x a−1(1− x )b−1 B(a, b)

where B(a,b), in terms of gamma functions (see the discussion in Chapter 4 after Exercise 4.6), is

B(a, b) =

Γ(a)Γ(b) Γ( a + b )

For a and b integers,

B(a, b) =

(a − 1)!(b − 1)! (a + b − 1)!

where we use the factorial notation a! to represent a(a − 1)(a − 2)···1. Note that when a = b = 1, the beta distribution is the same as the uniform distribution on [0,1]. The CDF of the beta distribution is commonly called the incomplete beta function. For any 0 < x < 1, the incomplete beta function F(x) is given by x

F( x ) =

∫ 0

y a−1(1− y )b−1 dy = I x (a, b) B(a, b)

Like the normal distribution (covered in Chapter 5), this integral cannot be written in closed form. Tables of the incomplete beta function are available (Pearson 1968). However, both Excel and OpenOffice provide the worksheet function BETADIST for the beta CDF. The arguments are x, a, b, respectively. In Excel, for example, = BETADIST(0.5,2,3) returns the result 0.6875. Figure 1.9 shows a variety of beta density functions for different values of the parameters a and b. The incomplete beta function Ix(a,b) is closely related to the binomial distribution, a key distribution used in quality control and other sampling applications. This important relationship will be covered in Chapter 10. EXERCISE 1.12 Find the expected value (mean) for a random variable having a beta density function defined on the unit interval, with parameters a and b. What relationship must these parameters have in order for the mean to be located in the center of the interval (or μ = 0.5)?

22

Applied Reliability

3

a = 6, b = 2

a = 2, b = 6

2.5

a=b=4

f(x)

2

a=1

, b=

2

a=b=2

1.5

, b=

a=2

1

a=b=1

1 0.5 0

0

0.25

0.5 x

0.75

1

FIGURE 1.9 Beta density functions.

1.8 How to Use Descriptive Statistics At this point, it is important to emphasize some considerations for the analyst. No matter what summary tools or computer programs are available, the researcher should always “look” at the data, preferably in several ways. For example, many data sets can have the same mean and standard deviation and still be very different—and that difference may be of critical significance (see Figure 1.10 for an illustration of this effect). Generally, the analyst will start out with an underlying model in his mind based on the type of data, where the observations came from, previous experience, familiarity with probability models, and so on. However, after obtaining the data, it is necessary that the analyst go through a verification stage before he blindly plunges ahead with his model. This requirement is where the tools of descriptive statistics are very useful. Indeed, in many cases we utilize descriptive statistics to help us choose an appropriate model right at the start of our studies. Other useful graphical techniques include Boxplots, dot plots, stem and leaf plots, 3D plots, and so on (see Chambers et al. 1983 for further information on graphical analysis). In this text, we focus on several key continuous distributions that are most applicable to reliability analysis: the exponential, Weibull, normal, and lognormal distributions. By learning what these distributions should look like, we can develop a yardstick to measure our data for appropriateness to some model. Graphics (frequency histograms, cumulative frequency curves) and summary values (mean, median, variance, skewness, etc.) are the means by which the characteristics of distributions are understood. In Chapter 6, we shall introduce other valuable descriptive procedures such as probability plotting.

23

Basic Descriptive Statistics

0.5

1

0.4

0.8

0.3

0.6

0.2

0.4

0.1

0.2

0

−1

0

1

2

3

0

0

2

4

6

0.5

1.4 1.2

0.4

1 0.3

0.8 0.6

0.2

0.4

0.1

0.2 0

0

2

4

6

0

−4

−2

0

2

4

6

FIGURE 1.10 Mean and Sigma do not tell us enough. These four distributions have the same mean and standard deviation.

1.9 Data Simulation Many different PDFs (and CDFs) exist, and reliability studies are often concerned with determining what model is most appropriate for the analysis. In reliability work, one may wish to simulate data from various distributions in order to do the following:

1. Determine percentiles of complicated distributions that are functions of common distributions. 2. Evaluate the effectiveness of different techniques and procedures for analyzing sample data. 3. Test the potential effectiveness of various experimental designs and sample size selections. 4. Illustrate statistical concepts, especially to understand the effects of variability in data.

24

Applied Reliability

Computer programs that will generate random variables from almost any desired distribution are available. However, there is a simple and general technique that allows us to produce what are called pseudorandom numbers from many common distributions. (The term pseudorandom is used because a specific computer algorithm generates the numbers to be as nearly random as possible.) To begin, we need a good table of random numbers or we can use a spreadsheet function to generate random numbers. For simplicity, we consider only distribution functions F(x) that are continuous and map one-to-one onto the unit interval (0,1), that is, 0 ≤ F ( x ) ≤ 1. This class includes all the life distributions discussed in this text. Let F ( x ) = u. Then, we can define an inverse function F −1 ( u) = x that provides the specific percentile corresponding to the CDF value in the unit interval. For example, given F ( x ) = .5, then F −1 (.5 ) = the median, which is the 50th percentile. F and its inverse have the following properties: F ( F −1 ( u)) = u and F −1 ( F ( x )) = x. To generate a random sample x1 , x2 ,..., xn from F(x), first generate a random sample u1 , u2 ,..., un from the uniform distribution defined on [0,1]. This procedure is done with random numbers. For example, if a five-digit random number is obtained from a table or a spreadsheet, divide the number by 100,000 to obtain a pseudorandom number from the uniform distribution. (The spreadsheet function RAND() provides random numbers directly in the unit interval.) Next, set x1 = F −1 ( u1 ) , x2 = F −1 ( u2 ) ,..., xn = F −1 ( un ). It is easy to show that the sample of x’s is distributed according to the F(x) distribution. (See the hint to Problem 1.4 at the end of this chapter.) EXAMPLE 1.9 DATA SIMULATION Let F(t) be the distribution given in Exercise 1.11. Generate a sample of five random times from this distribution.

Solution −1

We obtain F −1 by solving for t in F (t ) = u = 1− (1+ t ) to get t = u (1− u ) = F −1 (u ). Next, we use a random number generator via a spreadsheet function to obtain the uniform distribution sample (0.880, 0.114, 0.137, 0.545, 0.749). Transforming each of these by F −1 gives the values t1 = 0.880 ÷ (1 − 0.880 ) = 7.333, t 2 = 0.129 , t3 = 0.159 , t4 = 1.198, and t5 = 2.984. The sample (t1,t2 ,t3 ,t4 ,t5 ) is the desired random sample from F.

In a typical reliability experiment, n units are placed on stress, and the exact times to failure are recorded. The successive failure times naturally occur in increasing order, that is, the first failure time is less than the second failure time, the second is less than the third, and so on. This property of ordered failure times is a key characteristic of reliability work. In contrast, consider selecting n individuals randomly and measuring, for example, their weight or height. The successive observations will not necessarily occur in increasing order. Consequently, in simulating random variables for reliability studies, one would like the values arranged in increasing order. For a single set of simulated observations, one could do a manual sort using the spreadsheet sort routine available under the menu item Data. However, for repeated simulations (involved in Monte Carlo studies), a nonmanual procedure is desirable. In Excel (and OpenOffice), the spreadsheet function PER CENTILE(array, k) can be used. This function returns the kth percentile (where 0 ʺ k ʺ 1) of values in the range defined by the array. The trick is to choose the k values to be the (i − 1) multiples of 1/(n − 1), where i = 1, 2,…, n is the failure count and n is the sample size. We illustrate the procedure in the Example 1.10.

25

Basic Descriptive Statistics

Formulas

Output

FIGURE 1.11 Simulating ordered random variables.

EXAMPLE 1.10 DATA SIMULATION Let F(t) be a uniform distribution defined on the unit interval, that is, 0 ʺ t ʺ 1. Set up a spreadsheet to generate repeatedly a sample of five ordered random times from this distribution.

Solution We construct a spreadsheet as shown in Figure 1.11. The first column, labeled “Failure Count i,” is the index for the occurrence order: i = 1, 2, 3, 4, and 5. The second column, “Random Times,” contains the spreadsheet function = RAND() in each cell to generate a random uniform variable. The third column, “Ordered Times,” in cell C2 has the expression = PERCENTILE($B2:$B6,(A2−1)/ (5−1)). We use the autofill to complete the remaining cells in the third column. Now, whenever the spreadsheet recalculates a new set of random numbers in the second column (say by pressing the F9 key on a keyboard), the third column will contain those random numbers in increasing order. More examples of data simulation will be given in Chapters 3, 4, 5, 11, 12, and 13. Many of the examples worked out in this text will contain sets of data simulated from key distributions. We shall use such simulated data to illustrate our procedures and methods. EXERCISE 1.13 Let F(t) be a uniform distribution defined on the unit interval, that is, 0 ʺ t ʺ 1. Set up a spreadsheet to generate repeatedly a sample of 10 ordered random times from this distribution. Repeat the simulation 10 times and calculate the average of the times to the first failure, the second failure, and so on, up to the tenth failure. How do you keep track of the results from each repeated simulation? Plot the average values versus the failure count. Comment on the results. You have just completed a simple Monte Carlo study.

1.10 Summary In this chapter, we have introduced descriptive statistical techniques including histograms and cumulative frequency curves. We have discussed populations and samples. Simple concepts of probability illustrated with examples have been discussed. We have covered numerical measures of central tendency and dispersion, while emphasizing the importance of visualizing the observations. We have mentioned several important reliability distribution models and described how to simulate data from a distribution. An in-depth presentation of the applications and uses of these concepts in the study of reliability will be provided in Chapter 2.

26

Applied Reliability

Appendix 1A 1.1A Creating a Step Chart in a Spreadsheet We mentioned in Section 1.4 that the EDF is a step function increasing by 1/n at each successive failure time. Unfortunately, there is no direct graphical routine in Excel or OpenOffice for creating step plots. However, with a few simple manipulations, it is possible to create a step plot. Consider a reliability experiment in which five units were stressed and the times of failures were recorded. The data is shown in Figure 1.1A. To create a step chart in an Excel or OpenOffice spreadsheet, we create a separate table of two columns labeled X and Y, respectively, derived from the first table. In the first row of each column, we enter zero. Then, in the second and third rows of the first column, we enter the first failure time. In the next two rows, we enter the second failure time, and so on, until all failure times have been entered twice in the first column. In the second row of the second column, we enter zero and then in the third and fourth rows, we enter the EDF estimate at the first failure time. In the next two rows, we enter the EDF estimate for the second failure time and so on, until all EDF estimates except the last have been entered twice in the second column. The last EDF estimate of 1 is entered in the final row of the second column. The result is shown in Figure 1.2A.

FIGURE 1.1A Spreadsheet table for experiment.

FIGURE 1.2A Derived spreadsheet table for step chart.

27

Basic Descriptive Statistics

EDF step chart 1 0.9 0.8 0.7 EDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

20

40 60 Failure times

80

100

FIGURE 1.3A Step chart.

Using this expanded table and the X−Y scatter plot capability of the spreadsheet, we can produce, after some edits and labeling to improve appearance, the step chart shown in Figure 1.3A. In the reliability literature, step charts appear in different ways. For example, some authors remove vertical lines, use dashed instead of solid lines, add symbols, and so on. Spreadsheet graphical routines allow considerable flexibility in achieving the desired representation.

Problems

1.1 The following American experience mortality table gives the proportion living as a function of age, starting from age 10 in increments of 10 years. AGE: LIVING:

10 1.00

20 0.926

30 0.854

40 0.781

50 0.698

60 0.579

70 0.386

80 0.145

90 0.008

100 0.000

a. Calculate the percentage dying in each 10-year interval and plot the histogram. b. Calculate the average life span of 10-year-olds. Use the midpoints of the interval as the age at death. 1.2 For the electronic card assembly in Exercise 1.8, component C (with a probability of failing of 0.10) clearly was the major source of failures. If a second component C was added to the card in parallel with the first component C, so that both had to fail in order for the card to fail, what would be the probability of the card lasting 3 years now be? Note the card survives if both A and B survive and at least one of the component C survives.

28

Applied Reliability

1.3 Show that Bayes’ rule follows almost directly from the definition of conditional probability and the law of total probability. 1.4 Show that a sample generated using the general simulation method given in this chapter does, in fact, have the distribution F(x). Hint: Let X = F −1 (U ) be the random variable generated by applying F −1 to a uniform random variable. Now use P ( X ≤ x ) = P F ( X ) ≤ F ( x ) and the properties of inverse distributions, combined with the uniform distribution property that P (U ≤ u) = u for 0 ʺ u ʺ 1 , to show that P ( X ≤ x ) = F ( x ). 1.5 Each column in the table below contains 25 data points simulated from one of the distributions with shape and skewness and kurtosis described in Table 1.5. For each of the five sets of data, construct a histogram and calculate the mean, standard deviation, skewness, and kurtosis. Then try to identify the type of distribution from which each sample came. See if looking at the natural log of the data helps when it is difficult to decide between a lognormal model and a Weibull model. You can use either spreadsheet functions, JMP or Minitab, for these calculations. DATA1 3565 2713 1903 2694 4138 4395 4853 3531 2196 6195 2626 4632 2334 636 1366 3423 5610 1925 1080 1178 1646 5000 2273 1047 3174

DATA2

DATA3

DATA4

3752 2294 2804 3572 3573 3220 3404 3640 4494 3556 3008 3643 2825 1984 3295 2597 2809 3950 3691 3632 3220 2054 3407 3323 2550

4485 6248 1752 5992 1202 1159 4102 6024 7808 21335 7861 81 6715 946 9483 4385 2424 8247 897 8793 58 6466 6923 10885 7222

2673 2386 1900 2385 2112 2511 2679 6721 2841 4827 4238 2368 5801 2064 3177 3047 2889 4176 2353 5528 5351 3370 1538 1802 3033

DATA5 628 9116 27475 2793 9384 1936 2971 4870 12547 6782 5991 4457 3787 6163 11175 44581 6800 12906 1087 1051 3044 689 736 22 4126

2 Reliability Concepts This chapter introduces the terms and concepts needed to describe and evaluate product reliability. These include the reliability or survival function; the hazard and cumulative hazard functions; the failure rate and average failure rate (AFR); the renewal rate; the mean time to failure and the residual mean time to failure; and the well-known bathtub curve. In addition, we will look at the kinds of data that a reliability analyst typically obtains from laboratory testing or a customer environment (uncensored, right censored, and multicensored data). Less common kinds of data (left censored and truncated) are included for completeness.

2.1 Reliability Function The theoretical population models used to describe device lifetimes are known as “life distributions.” For example, if we are interested in a particular type of microprocessor, then the population might be all the lifetimes obtainable from microprocessors of this type. Alternatively, we might want to restrict our population to just microprocessors from a specific manufacturer made during a set time period. In either case, the cumulative distribution function (CDF) for the population is called a life distribution. If we denote this CDF by F(t), then F(t) has two useful interpretations:

1. F(t) is the probability that a random unit drawn from the population fails by t hours. 2. F(t) is the fraction of all units in the population that fail by t hours.

Pictorially, F(t) is the area under the probability density function f(t) to the left of t. This area is shown in Figure 2.1. The total area under f(t) is unity (i.e., the probability of failure approaches one as t approaches infinity). Since F(t) is a probability, the shaded region has an area equal to the probability of a new unit failing by t hours of operation. This equivalence of area to probability generalizes such that the area under f(t) between two vertical lines drawn at time t1 and a later time t2 corresponds to the probability of a new unit surviving to time t1 and then failing in the interval between t1 and t2. This area can be obtained by taking all the area to the left of t2 and subtracting the area to the left of t1, which is just F(t2) − F(t1). F(t2) − F(t1) is the probability that a new unit survives to time t1 but fails before time t2. F(t2) − F(t1) is also the fraction of the entire population that fails in that interval. Since it is often useful to focus attention on the unfailed units, or survivors, we define the reliability function by

R(t) = 1 − F(t) 29

30

Applied Reliability

f (t)

Area = F (t) f (t)

Time (t)

t1

t2

FIGURE 2.1 Cumulative distribution function.

(We could also call F(t) the “unreliability” function.) In medical literature, the equivalent term survival function is often used and denoted as S(t). The reliability function may be thought of in either of two ways:

1. As the probability that a random unit drawn from the population is still operating after t hours 2. As the fraction of all units in the population that will survive at least t hours

If n identical units are operating and F(t) describes the population they come from, then nF(t) is the expected (or average) number of failures up to time t, and nR(t) is the number expected to still be operating. EXAMPLE 2.1 LIFE DISTRIBUTION CALCULATIONS Suppose that a population of components is described by the life distribution F(t) = 1 − (1 + 0.001t)−1. What is the probability that a new unit will fail by 1000 hours? By 4000 hours? Between 1000 and 4000 hours? What proportion of these components will last more than 9000 hours? If we use 150 of them, how many do we expect to fail in the first 1000 hours? In the next 3000 hours?

Solution By substitution, F(1000) = 1 − (1 + 1)−1 = 0.5, and F(4000) = 1 − (1 + 4)−1 = 0.8. These are the probabilities of failing in the first 1000 and first 4000 hours, respectively. The probability of failing between 1000 and 4000 hours is F(4000) − F(1000) or .3. The proportion surviving past 9000 hours, R(9000), is (1 + 9)−1, or .1. Finally, the expected number of failures in the first 1000 hours is 150 × 0.5, or 75. In the next 3000 hours, an additional 150 × 0.3, or 45 units, are expected to fail. EXERCISE 2.1 Suppose that a population of components follows the life-distribution model described as follows:

F(t) = 1 − e− (t/2000) 0.5

31

Reliability Concepts

a. What is the probability that a new unit will fail by 500 hours? By 3000 hours? b. Between 500 and 3000 hours? c. What proportion of new units will last more than 5000 hours? d. If we use 200 of them, how many do we expect to fail in the first 500 hours? In the next 2500 hours?

2.2 Some Important Probabilities Because F(t) and R(t) are probabilities, a few simple but powerful formulas can be derived easily using the basic rules for calculating the probabilities of events presented in Chapter 1. The two rules needed in this section are as follows:

1. Multiplication Rule: The probability that several independent events will all occur is the product of the individual event probabilities. 2. Complement Rule: The probability that an event does not occur is 1 minus the probability of the event.

In our application, the independent events are the failure or survival of each of n randomly chosen, independently operating units. If we want the probability of all of them still operating after t hours, we apply the multiplication rule and form the product of n R(t) terms. In other words, the probability that n independent identical units, each with reliability R(t), all survive past t hours is [R(t)]n. The probability that at least one of the n units fails is one minus the probability that all survive. Applying the complement rule, the probability that at least one of n independent identical units fails by time t is given by

1 − [ R(t)]n = 1 − [1 − F(t)]n

The power of these formulas is readily apparent if we consider a simple system composed of n identical components all operating independently (in terms of working or failing). If the life distribution for each of these components is F(t), then the probability that the system does not have a failure by time t is [R(t)]n. If the system fails when the first of its components fails, and we denote the life-distribution function for a population of these systems by Fs(t), then the complement rule gives us

Fs (t) = 1 − [R(t)]n

This equation shows how, in this simple case, system reliability is built up using a bottoms-up approach starting with the individual component reliabilities. This is a key concept, which will be discussed in detail in Chapter 10.

32

Applied Reliability

EXAMPLE 2.2 SYSTEM RELIABILITY A tragic illustration of how a system’s reliability can be much worse than any of its components is given by the 1986 Challenger space shuttle disaster. The seal failure of the O-rings under icy conditions contributed to the shuttle failure. Each of the two solid rocket boosters consisted of four sections joined by pairs of O-rings (a total of 12 O-rings), and all of them had to work properly. Assuming independent reliability under freezing conditions for a single O-ring at 0.99, the system reliability, for just this part, is 0.9912 = 0.89. Therefore, the potential for system failure is 0.11, or roughly 1 in 9. The Challenger mishap occurred on the tenth launch of this shuttle. EXERCISE 2.2 Assume a computer contains 1000 identical components, all operating independently and each critical to the operation of the computer. If each component has a reliability estimated at 0.9998 for the warranty period, what is the probability the computer fails while under warranty?

We turn now from the probabilities of failure to the various ways of defining the rates at which failures occur.

2.3 Hazard Function or Failure Rate Consider a population of 1000 units that start operating at time zero. Over time, the units fail one by one. Say that at 5000 hours, with four units having already failed, another unit fails in the next hour. How would we define a “rate of failure” for the units operating in the hour between 5000 and 5001? Since 996 units were operating at the start of that hour, and one failed, a natural estimate of the failure rate (for units at 5000 hours of age) would be 1/996 per hour. If we look closely at that calculation, we see that we have calculated a conditional rate of failure, or rate of failure for the survivors at time t = 5000 hours. We can make this definition more precise by using the concept of conditional probability discussed in Chapter 1. There, we used the notation P(A) to denote the probability of event A occurring and P(B|A) to denote the conditional probability that event B will occur, given that A is known to have occurred. Moreover, P(B|A) was defined as follows:

P(B|A) =

P(B and A both occur) P( A)

Using this formula, we can calculate the probability of failing in a small interval of time, Δt, after surviving up to time t, as follows:

P(fail in next t|survive to time t) =

F(t + t) − F(t) R(t)

We divide this by Δt to convert it to a rate and obtain

P(fail in next t|survive to time t) F(t + t) − F(t) = t R(t) t

33

Reliability Concepts

If we now let Δt approach zero, we obtain the derivative of F(t), denoted by F′(t), divided by R(t). Since F′(t) = f(t), we have derived the expression for the instantaneous failure rate or hazard rate h(t):

h(t) =

f (t) R(t)

For the remainder of the text, the terms failure rate, instantaneous failure rate, and hazard rate will all be equivalent and have the above definition. The units for the rate we have just defined are failure proportions per unit time. It is the failure rate of the survivors to time t in the very next instant following t. It is not a probability and can have values greater than 1 (although it is always nonnegative). In general, the rate is a function of t and is not a single number or constant. As we shall see in Section 2.4, any nonnegative function defined everywhere on the time axis can be a legitimate hazard function as long as it has an integral from 0 to ∞ that is infinite. The reader should be cautioned that not all authors use the same definition when talking about failure rates. Some authors define the failure rate to be f(t), which is the rate of failure of the original time-zero population at time t.

2.4 Cumulative Hazard Function Just as the probability density function f(t) can be integrated to obtain the CDF F(t), the hazard function h(t) can be integrated to obtain the cumulative hazard function H(t): t

H (t) =

∫ h(y)dy 0

Using h(t) = f(t)/R(t), we can express this integral in closed form as

H (t) = −ln R(t)

where the notation “ln” denotes natural logarithms or logarithms to the base e. EXERCISE 2.3 Verify that the closed-form expression for H(t) is correct by taking derivatives of both sides and obtaining the definition of the hazard function h(t).

By taking antilogarithms in the above equation for H(t), a well-known and useful identity relating failure rates and CDFs is obtained as follows: t

F(t) = 1 − e− H (t ) = 1 − e

∫

− h( y )dy 0

This expression shows that given H(t), we can calculate F(t), and vice versa. So, in a sense, all the quantities we have defined give the same amount of information: with any one of F(t), f(t), h(t), or H(t), we can calculate all of the others. Also note that H(t) has to go to

34

Applied Reliability

infinity as t goes to infinity; otherwise, F(t) would not approach 1 toward the end of life, as is required. H(t) is useful in graphical plotting for estimating life-distribution parameters from failure data (Nelson 1969, 1972). EXERCISE 2.4 Derive the equation for F(t) given that h(t) is the constant λ, that is, h(t) = λ for all t. EXERCISE 2.5 Which of the following could be a CDF, probability density function (PDF), or hazard function for a life-distribution function defined on (0, ∞)? a. ect, c > 0 b. e –ct, c > 0 c. At5, A > 0 d. At−3, A > 0 e. kebt/t1.5, k > 0; –∞ < b < ∞

2.5 Average Failure Rate Since the failure rate h(t) varies over time, it is useful to define a single average number that typifies failure-rate behavior over an interval. This number might be used in an engineering specification for a component or it might be an input to service cost and stockreplacement calculations. A natural way to define an AFR between time t1 and t2 is to integrate the (instantaneous) failure rate over this interval and divide by t2 – t1: t2

AFR(t1 , t2 ) =

∫ h(t)dt

t1

t2 − t1

=

H (t2 ) − H (t1 ) ln R(t1 ) − ln R(t2 ) = t2 − t1 t2 − t1

If the time interval is from 0 to T, the AFR simplifies to

AFR(T ) =

H (T ) − ln R(T ) = T T

and this quantity is approximately equal to F(T)/T for small F(T), that is, F(T) less than about 0.10. Conversely, in the time interval 0 to T, if we know the AFR(T), the CDF F(T) for any distribution is given exactly by the relation

F(T ) = 1 − e − T × AFR (T )

which is approximately equal to T × AFR(T ) for small F(T).

35

Reliability Concepts

AFR(T) finds frequent use as a single-number specification for the overall failure rate of a component that will operate for T hours of useful life. For example, if the desired lifetime is 40,000 hours, then AFR(40,000) is the single average lifetime failure rate. It should be noted that the AFR is not generally defined or used in most of the literature on reliability, despite its usefulness. Furthermore, the acronym AFR itself is sometimes used as the reliability term to indicate “annual or annualized failure rate.”

2.6 Units Failure rates for components are often so small that units of failures per hour would not be appropriate. Instead, the scale most often used for failure rates is percent per thousand hours (%/K). 1%/K would mean an expected rate of 1 fail for each hundred units operating for 1000 hours. Another scale that is popular for highly reliable components is parts per million per thousand hours (PPM/K); thus, 1 PPM/K means one failure is expected out of one million components operating for 1000 hours. Another name for PPM/K is FIT for fails in time (other authors have stated that the name FIT comes from failure unit). The equivalent definition of a FIT is a failure per 109 device hours. This name and definition will be used throughout this text. The factors used to convert h(t) and AFR to %/K or FIT are given below: Failure rate in %/K = 105 × h(t) AFR in %/K = 105 × AFR(T1,T2) Failure rate in FITs = 109 × h(t) = 104 × failure rate in %/K AFR in FITs = 109 × AFR(T1,T2) EXAMPLE 2.3 FAILURE RATE CALCULATIONS For the life distribution F(t) = 1 − (1 + 0.001t )−1, derive h(t) and calculate the failure rate at 10, 100, 1,000, and 10,000 hours. Give the last failure rate in both %/K and FIT. What is AFR(1000)? What is the AFR between 1,000 and 10,000 hours? If five components, each having this life distribution, are starting operation, what is the probability they will experience no failures in the first 1,000 hours?

Solution F(t) = 1 − (1 + 0.001t)−1 and, by taking the derivative, f(t) = 0.001(1 + 0.001t)−2. By definition

h(t ) = f (t )/(1− F (t )) = 0.001(1+ 0.001t )

−1

By substitution, h(10) = 0.001/1.01 = 0.00099. Similarly, h(100) = 0.001/1.1 = 0.00091, h(1000) = 0.0005 and h(10,000) = 0.000091. The last failure rate is 9.1%/K and 91,000 FITs. AFR(1000) = −ln R(1000)/1000 = −ln (1 + 1) −1/1000 = 0.0007. AFR(1000,10000) = [ln 2−1 − ln 11−1]/9000 = 0.00019. The probability that five components operate for 1000 hours without any failure is R(1000)5 by applying the multiplication rule. This expression yields [(1 + 1) −1]5 = 2−5 = 1/32.

36

Applied Reliability

EXERCISE 2.6 Using the life distribution from Exercise 2.1, calculate the failure rate at 10, 100, 1,000, 10,000, and 100,000 hours. Give the answers in both %/K and FITs. EXERCISE 2.7 What is the AFR for the life distribution in Example 2.3, between 500 and 3000 hours, expressed in %/K?

2.7 Bathtub Curve for Failure Rates Any nonnegative function h(t) whose integral H(t) approaches infinity as time approaches infinity can be a failure-rate function. However, in practice most components, or groups of components operating together as subassemblies or systems, tend to have failure-rate curves with a similar kind of appearance. This typical shape for failure-rate curves, called the bathtub curve, is shown in Figure 2.2. The first part of the curve, known as the early failure period, has a decreasing failure rate. During this period, the weak or marginally functional parts are weeded out. In reliability literature, failures in this period are commonly called infant mortality failures. In Chapter 10, we will see that it also makes sense to include the discovery of defects that escape to the field (a characteristic of software reliability) in the front end of this curve. The long, fairly flat portion of the failure-rate curve is called the stable failure period (also known as the intrinsic failure period). Here, failures seem to occur in a random fashion at a uniform or constant rate that does not seem to depend on how long the part has been operating. For example, cosmic radiation is known to cause memory-device failures that do not depend on the actual operating hours of the integrated circuit. Most of the useful life of a component should, and usually does, take place in this region of the curve. The final part of the curve, where the failure rate is increasing, is identified as the wearout failure period. Here, degradation failures occur at an ever-increasing pace. One of the main purposes of reliability testing is to ensure that the onset of the wear-out period occurs far enough out in time as to not be a concern during the useful life of the product. (Another key purpose of reliability testing is to establish the value of the failure rate during the long, stable period.)

Item life

Failure rate

Expected (or observed) failure rate

Early failure period t0

FIGURE 2.2 Bathtub curve for failure rates.

Stable failure period Time

Wearout failure period

37

Reliability Concepts

It is interesting to note that while every experienced reliability analyst has come across many examples of real failure data exhibiting the shape shown in Figure 2.2, none of the prominent life distributions discussed in the literature, or in this book, have that shape for h(t). These distributions, as we shall see in the next three chapters, only fit one (or at most two) region(s) of the curve reasonably well. In Chapter 10, algorithms for obtaining the entire curve will be developed. EXAMPLE 2.4 ESTIMATING THE CDF, RELIABILITY FUNCTION, AND AFR In Figure 2.3, the lifetimes of ten similar (nonrepairable) components are represented by horizontal lines. The solid circle (∙) on the line shows that a failure occurred at the corresponding time. At 800 hours, one unit (#10) is still surviving, and its (unknown) lifetime is called censored. How would we estimate the CDF F(t) and reliability function R(t) at 500 and at 550 hours? What would be an estimate of the failure rate h(t) or AFR for the interval 500–550 hours? What would be an estimate of the PDF f(t) over this interval?

Solution The various estimates derived from Figure 2.3 are as follows:

Fˆ(500) = 5 / 10

Rˆ(500) = 5 / 10

Fˆ(550) = 7 / 10

Rˆ(550) = 3 / 10

ˆ (500 to 550) hˆ(500 to 550) = ( 2/ 5)/(50)hours = AFR

fˆ(500 to 550) = [ 2/ 10]/(50)hours

10 9 8 Component

7 6 5 4 3 2 1 0 FIGURE 2.3 Example of component failure data.

100

200

300

400 Time

500

600

700

800

38

Applied Reliability

This example clearly illustrates the differences among the CDF, PDF, and the hazard rate. To estimate F at a given time, the denominator is the starting number of components and the numerator is the cumulative number of failures. To estimate f over an interval, the same denominator is used, but only the failures in that interval are in the numerator. This estimate approaches the true f in value if the population is large and the interval is small. The failure rate or h(t) estimate uses a different denominator. Instead of the time-zero original population, only the survivors at the start of the interval are used in the calculation. Since five of the original units failed by 500 hours, the denominator is 10 − 5 = 5. The numerator is the number of first-time failures in the interval, or 2. This process actually estimates the AFR over the interval. As with the estimate of f(t), the larger the original population and the smaller the interval, the better this method estimates h(t). Note that all the rate estimates also have to be divided by 50 hours for conversion to failures per unit hour. These estimates may then be multiplied by the appropriate constant, as explained in Section 2.6, if %/K or FIT is desired. EXERCISE 2.8 For the mortality table given in Problem 1.1 at the end of Chapter 1, calculate the hazard (or failure) rate for each ten-year period (in % per year) and plot it.

2.8 Recurrence and Renewal Rates As opposed to nonrepairable components such as integrated circuits, another kind of failure rate can be defined that applies to systems such as network servers that are repaired following a failure. For example, consider a computer that fails and then is restored to operation by replacing the failed part with a new part that may, in turn, later fail. If the distribution of times between successive failures, called the interarrival times, is independent and identically distributed, then we have what is called a renewal process. In other words, the fact that the interarrival times come from the same failure-distribution population implies that the system is restored to “as good as new” following a repair under a renewal process. Such a situation would normally be expected for a single-component system in a stable environment if the replacements are always new parts from the same population as the original component. The mean or expected number of failures per unit time interval at time t is known as the renewal rate r(t). For both renewal and nonrenewal processes, in the reliability literature, one frequently encounters the equivalent expression ROCOF, for rate of occurrence of failures. In this text, we will often use the simpler term recurrence rate or RR. Which failure rate is better to use—the hazard rate that we call the failure rate or the renewal or recurrence rate? The hazard rate (or the AFR) should be used for a population of nonrepairable components. Renewal rates apply to a renewal process. Recurrence rates encompass repairable processes including nonrenewal situations where, for example, instead of replacing parts with brand new parts when they fail, a part similar to the failed part in terms of remaining life is used (e.g., swapping parts in systems). Another possibility is to use a new replacement part that comes from a period or vintage of manufacturing that is different from that of the original part. As a result, the replacement part may have a dissimilar life-distribution curve, and so the renewal rate concept no longer applies.

39

Reliability Concepts

When failure rates are nearly constant, as on the long, flat portion of the bathtub curve for both repairable and nonrepairable systems, the recurrence rate, the failure rate, and the AFR will all have approximately the same numerical value. For components, however, we will use hazard or failure rates and AFRs throughout this text. For repairable systems, we will use renewal or recurrence rates. Further discussion on properties of repairable systems is provided in Chapters 12 and 13.

2.9 Mean Time to Failure and Residual Lifetime In Chapter 1, we defined the mean of a lifetime distribution as ∞

=

∫ tf (t)dt 0

This quantity is known as the mean time to fail or MTTF. It is the average time of failure for all the failure times in the population, and we can get an estimate of the MTTF by averaging the failure times in a random and uncensored sample taken from the population. In some applications, we may be interested in how much lifetime, on average, a unit has left after having already survived for T0 hours. For this purpose, we can define a mean residual lifetime ResidualMTTF(T0) by the expression ∞

Re sidualMTTF(T0 ) =

f (t)

∫ t R(T ) dt 0

To

Note that we use the conditional density f(t)/[1 − F(T0)], defined on (T0 , ∞), because we know the unit has already survived T0 hours and we want its expected lifetime from that point on. We can also calculate the residual expected lifetime from the equivalent integral ∞

Re sidualMTTF(T0 ) =

∫ R(t)dt

T0

R(T0 )

EXAMPLE 2.5 Residual MTTF(T0) CALCULATION A component is known to have a lifetime that cannot exceed b hours and a CDF given by

a

F(t) =

t , 0 ≤ t ≤ b, a > 0 b

40

Applied Reliability

Find a. R(t) and f(t) b. h(t) c. the MTTF d. the ResidualMTTF(T0) for 0 < T0 < b e. R(t), the MTTF, and the ResidualMTTF(T0) for the two special cases a = 1 and a = 2

Solution R(t ) = 1− h(t ) =

t b

a

, 0 ≤ t ≤ b and f (t ) = a

t a−1 , 0 ≤ t ≤ b; from these results, it follows that ba

at a−1 , 0 ≤ t ≤ b. The MTTF is calculated as follows: ba − t a a+1

b

MTTF =

∫ 0

1−

t b

a

dt =

b

t b b t− a+1

=

ab a+1

0

The calculation for the ResidualMTTF(T0) has the same integrand, this time divided by R(T0), with new limits from T0 to b. This results in b

a+1 ⎡⎛ ⎛ t ⎞ ⎞⎤ b⎜ ⎟ ⎟ ⎥ ⎢⎜ ⎢⎜ t − ⎝ b ⎠ ⎟ ⎥ ⎢ a+1 ⎟⎥ Re sidualMTTF(T0 ) = ⎢⎜ a ⎟⎥ ⎜ ⎛ ⎢⎜ 1− ⎜ T0 ⎟⎞ ⎟ ⎥ ⎝ b ⎠ ⎟⎥ ⎢⎜ ⎢⎣⎝ ⎠ ⎥⎦0

After working through the algebra, we get

Re sidualMTTF(T0 ) =

ab a+1 − (a + 1)b aT0 + T0a+1 (a + 1)(b a − T0a )

Special Case 1: a = 1: Substituting 1 for a, we obtain (either by substitution above or reworking the integrals with the new integrands)

R(t ) =

b b−t 1 b − T0 , h(t ) = , MTTF = , and ResidualMTTF((T0 ) = b b−t 2 2

Special Case 2: a = 2: Substituting 2 for a, we obtain (either by substitution above or reworking the integrals with the new integrands)

R(t ) = 1 −

t b

2

, h(t ) =

2b3 − 3b 2T0 + T0 3 2t 2b , MTTF = , and ResidualMTTF(T0 ) = 3(b 2 − T0 2 ) b2 − t 2 3

41

Reliability Concepts

Note that the hazard function, for the special case where a = 1, is clearly an increasing function, (known in the literature as an increasing hazard function or IFR), and the ResidualMTTF(T0), as one might expect, decreases as T0 increases. EXERCISE 2.9 a

b , 0 ≤ t < ∞, a > 1, find the hazard function, b+t the MTTF, and the ResidualMTTF(T0). Show that when a = 1, both the MTTF and the ResidualMTTF(T0) are undefined (infinite). Show that when a = 2, the ResidualMTTF(T0) is an increasing function of T0, equal to b + T0. Note that in this case, we have the interesting result that the expected life is longer by exactly the same amount of time as the unit was exercised (or burned in). For the distribution function F(t) = 1 −

EXERCISE 2.10 If F(t) = 1 − e−λt (the exponential distribution), show that the MTTF and the ResidualMTTF(T0) are both λ, independent of the choice of T0. What does this say about “aging” if the exponential life-distribution model applies?

2.10 Types of Data The statistical analysis of reliability data is often more complicated and difficult than the analysis of other experimental data because of the diversity of forms this data may take and the presence of censored or truncated observations. Most other experiments yield straightforward sample data: a random sample of size n gives n numbers that can be used to make inferences about the underlying population. Such complete data are seldom available in reliability evaluations. The kinds of data generally encountered fall into the two categories of either exact failure times or readout (or interval or grouped) data. Non-failed units may have single or multiple right-censoring times. Less commonly, an analyst may encounter left-censored data or truncated data. All these types of data will be described in this section. 2.10.1 Exact Times: Right-Censored Type I Suppose n units are put on test for a fixed planned duration of T hours. Say that r of them fail, and the exact failure times t1 ≤ t2 ≤ t3 ≤ … ≤ tr ≤ T are recorded. At the end of the test, there are n − r survivors (unfailed units). All that is known about the survivors is that their times of failure are beyond T. Such testing is called right-censored Type I. Time-censored data have the advantage of ensuring that timetables are met, since the duration T is set in advance. A fixed test time is valuable in situations where there may be serious impacts if schedules are not rigidly achieved. There is one problem with this kind of testing: the number of failures is not known in advance. This number r is a random quantity. Since, as we shall see, the precision of failure-rate estimates depends on the number of failures r and not on the number of units n on test, a bad choice of sample size or test conditions may result in insufficient information obtained from the test. The test time may fit within the allotted schedule, but the test results may be inadequate.

42

Applied Reliability

2.10.2 Exact Times: Right-Censored Type II Again, we place n units on test and record the exact times when failures occur. However, instead of ending the test at a predetermined time, we wait until exactly r failures occur and then stop. Since we specify r in advance, we know exactly how much failure data the test will yield. This count-censored procedure has obvious advantages in terms of guaranteeing adequate data. However, the length of test time is random and open ended. Based on practical scheduling considerations, Type II testing is usually ruled out in favor of Type I. 2.10.3 Readout Time or Interval Data Both of the testing schemes require instruments that can record exact times of failure. Thus, when testing electronic components, continuous in situ monitoring is required. However, this kind of test setup may be impractical or too costly in many cases. The following is a reasonable, commonly used testing procedure: n components are put on test at time zero. T1 hours later, a “readout” takes place whereby all the components are examined and failures are removed. Let us say r1 failures are found. Then, n − r1 components go back on test. At time T2, after T2 − T1 more hours of test, another readout takes place. This time r2 failures are found, and n − r1 − r2 units go back on test. The process is continued with readouts at time T3, T4, and so on. The last readout, at time Tk = T, takes place at the end of the test. Figure 2.4 illustrates the sequence of events. This form of data is called readout or interval or grouped data. The readout times are predetermined, as is the end of test time. The number of failures occurring in an interval is not known until the readout takes place. The exact times of failure are never known. Readout data experiments have the same problem as those with exact failure time data: the experiment may end before a sufficient number of failures takes place. In addition, precision is lost by not having exact times of failure recorded. Even if there are many failures, the data may be deficient if these failures occur in too few intervals. Despite the above drawbacks and the difficulties in analyzing readout data, it is probably the most common type of reliability data. In situ monitoring usually requires expensive test equipment that is often neither available nor cost justified. The reliability analyst must learn to make the best possible use of readout data experiments. This goal involves careful planning of the test (i.e., sample sizes and times of readout) as well as the use of good analysis methods. 2.10.4 Multicensored Data In the most general case, every unit on a test may have a specified interval during which it is known to have failed, or a censoring time past which it is known to have survived (often called a runtime). These intervals and censoring times might be different for each

r1 Fail

n Start test T0 FIGURE 2.4 Readout data.

r2 Fail T1

n − ∑ri Left

r3 Fail T2

T3

Tk = T

time

Reliability Concepts

43

unit. An exact time of failure would be a “degenerate” or very small interval. This kind of data, with varying censoring times, is called multicensored. In the laboratory, it is rare that every unit has a different censoring time, but less complicated examples of multicensored data are common. In the field, many types of multicensored data frequently occur. To illustrate, consider a datacenter in which a number of similar network servers are operational. These servers would most likely have been installed at different times, and consequently each would have different operating hours on a given reference date, resulting in different censoring times. We will provide further discussion of this topic for repairable systems in Chapters 12 and 13. 2.10.5 Left-Censored Data For a sample of n units, exact times of failure may occasionally be missing for the earliest failures. All that is known is that r failures took place before time T, and from then on, failure or censoring times were recorded on the remaining n – r units. This situation is known as left-censored failure data. Left-censored data arise for failures occurring before the first readout for interval data, since we know only that the failures happened before the first test inspection. In another instance of left censoring, we consider a system that has subsystems such that a failure in one subsystem would not necessarily cause the overall system to fail. Such might be the case for the failure of a radio or one of the headlights in a car. When the subsystem is eventually repaired at some later time, we may have only an upper bound but not the exact time when the failure occurred. 2.10.6 Truncated Data Truncated data may look like either right- or left-censored data, with known exact or interval data values (times of failure or measured values) and a time T before which (left truncated) or after which (right truncated) no exact values are known. However, for censored sample data, we know how many unseen data values fall either before or after T. In the case of truncated data, because of population conditions or restrictions, we do not know how many values belonging to the truncated region are missing from our sample. We have a sample from a population that is missing an unknown number of data values that would have fallen in the unobserved range. For example, suppose the product is subjected to a burn-in of 6 hours prior to release. Information is not available on the number of failures occurring during the burn-in period, resulting in left-truncated data on the units surviving burn-in. Left truncation is also common in medical studies. For example, the objective might be to investigate the mortality rate from the onset of a disease, but observations are available on patients only from the date of entry into the program. Thus, the number of possible patients included in the time between the onset of the disease and the beginning of tracking is unknown, resulting in left-truncated data. Right-truncated data can occur, for example, when observations above a certain value T are neglected. Consider a random sample of n units. It is suspected that a certain percentage of the units are defective. A stress of 100 hours is run, and 8% of the units fail for the suspected defect. However, because we do not know the number of remaining defective units in the sample, we consider the data right truncated at 100 hours. The important difference between censored and truncated data is that truncated items (typically an unknown number) are excluded from the sample under observation. In contrast, censored observations are included in the study as part of the sample.

44

Applied Reliability

EXAMPLE 2.6 MULTICENSORED EXPERIMENTAL DATA Capacitors are to be tested on fixtures mounted in ovens. While the test is on, the parts are subjected to a fixed high voltage and high temperature. At the end of each day, they are removed from the ovens and tested to determine which ones have failed. Then, the unfailed units will be put back on test. The plan is to continue in this way for 1000 hours. This test plan would normally yield standard readout data, but several unexpected things might happen to change this. Two possibilities illustrate this point. Assume the test starts with 200 capacitors in four ovens, each containing 50 units. Halfway through the test, one of the ovens malfunctions, causing all further data on its 50 parts to be invalid. This event results in multicensored data with the capacitors in the bad oven “taken off test” at the time of malfunction. The other ovens and parts continue for the full test time. Instead of a malfunction, the same situation might occur if priorities change, and the test engineer must give up one of his ovens before his or her capacitor experiment is completed. These examples show that a mild form of multicensoring may occur even when straightforward readout data is expected. The next example (Example 2.7) shows a case where heavily multicensored data arise as a natural consequence of the data collection scheme. EXAMPLE 2.7 MULTICENSORED FIELD FAILURE DATA Data obtained from components operating in customer environments often has a characteristic form of multicensoring. The data, in this case, consists of failure information on several groups of components. Each group is under observation for a possibly different interval of time, and data is read out once at the end of that interval. A concrete example will make this clear. Assume a machine uses 100 components of a type that is of interest. We are able to examine the field history of ten machines. Each of these machines has operated a different length of time, based on the date of customer installation. For each machine, only the total number of component failures, up to the date of the investigation, is available. All failures are assumed to come from the original 100 components and no repairs are considered necessary, despite failures. (Fault-isolation circuitry in computer systems is an example where a machine may continue to operate despite component failures.) Each machine provides data on a group of 100 components that were “on test” for just one readout interval. This interval is the amount of time the machine has operated and varies from machine to machine. There are also ten different censoring times: one for each machine’s group of unfailed units. Methods for analyzing censored and multicensored data will be covered in the next 5 chapters for nonrepairable systems or units. EXAMPLE 2.8 LEFT-TRUNCATED DATA An automated microscope scanner is set up to visually inspect wafers for defects. The scanner spots and lists all defects in its field of vision that are larger than y microns in size. After examining n wafers, an attached computer records and lists every defect observed, along with the defect size. However, all defects on the wafers that are less than y microns in size are neither observed nor recorded. This procedure results in a left-truncated set of sample data, missing all defects in the population below y microns in size, with no count of how many are missing. If the missing portion of the defect population has importance for reliability modeling, a left-truncated analysis should be done on the recorded sample data. EXAMPLE 2.9 LEFT- AND RIGHT-CENSORED DATA In a study by Turnbull and Weiss (1978) on the use of marijuana, 191 high school students were asked, “When did you first use marijuana?” There were several possible answers. Some provided the exact age of first usage, which represented uncensored observations. For those who had never

45

Reliability Concepts

used marijuana, the data point was right censored. For anyone who had used marijuana but could not recall the age of first try, the observation was left censored at the student’s current age. This example illustrates what is called doubly censored data.

2.11 Failure Mode Separation Multicensored data can also come about when test components fail for more than one reason. For example, a corrosion failure mode and a metal migration failure mode might both take place when testing a semiconductor chip. Each mode of failure might follow a completely different life distribution, and it is desirable to analyze the data for each mode separately. If the failure modes act independently of one another, we can analyze the corrosion data alone by treating the times when the units failed due to migration as random censoring times, which is called noninformative censoring. After all, if migration failure had not occurred, these units would have eventually failed due to corrosion. Instead, they were “taken off test” at the time of their migration failure. The analysis of migration failures is done in a similar fashion, by treating the times of corrosion failures as random censoring times. This approach can be extended to more than two independent modes of failure. This topic is treated further in the section on analysis of competing risk data in Chapter 10, Section 10.2.

2.12 Summary This chapter defined the reliability function R(t) = 1 − F(t) and the (instantaneous) failure rate h(t) = f(t)/R(t). This failure rate applies at the instant of time t for the survivors of the starting population still in operation at that time. An AFR over the interval (t1, t2) can also be defined by integrating the failure rate over this interval and dividing by t1 − t2. This expression equals [ln R(t1) − ln R(t2)]/(t2 − t1). A useful identity that shows how the CDF can be reconstructed from the failure rate is t

F(t) = 1 − e

− H (t )

= 1− e

∫

− h( y )dy 0

A plot of failure rate versus time for most components or systems yields a curve with the so-called bathtub shape. The front decreasing portion shows early life fallout. Then, a long, fairly flat region occurs. Finally, at some point in time, wear-out failures due to degradation mechanisms start to dominate, and the failure rate rises. One common aspect of reliability data causes analysis difficulties. This feature is the censoring that takes place because, typically, not all units on test fail before the test ends. Reliability data may consist of exact times of failure up to the end of a fixed-length test. This type of data is called Type I censoring (also referred to as time censoring). Type II censoring (also called failure censoring) refers to a test that lasts until a prespecified number of failures occur. While Type II censoring may lead to better data, it is less popular because of the open-ended nature of the test duration.

46

Applied Reliability

Perhaps the most typical kind of data consists of numbers of failures known only to have occurred between test readouts. This type is called readout or interval or grouped data. If all the units do not have the same end of test times, the data is called multicensored and can be the most difficult to analyze. Multicensored data may also come about when some units on test are damaged or removed prematurely or fail due to more than one failure mode. Field (or customer operation) data can also come in a multicensored form. Left-censored data results when some fails occur in the readout interval (0, T), and exact or interval times of failure are known after T. Truncated data occurs when all observations that have a particular range of values are missing from the sample, and the numbers that are missing are typically not known.

Problems

2.1 Show that an alternate, equivalent definition of the hazard rate is the negative derivative of the natural logarithm of the reliability function. 2.2 Use the data in Figure 2.3 to estimate the CDF and the reliability function at 600 and 700 hours. Estimate the failure rate for the interval 600–700 hours. Also estimate the renewal rate and f(t) for this period. 2.3 Fifty units are tested for 1000 hours. Thirty exact times of failure are recorded. After failure analysis has been completed, it is known that there were three different failure modes. Mode A had 13 failures, mode B had 9 failures, and mode C had 8 failures. Describe how you would summarize the data first to analyze the failure modes A, then B, and then C. What assumption are you making? 2.4 From the mortality table given in Problem 1.1, generate a new mortality table for 50-year-olds. (Note that the fraction living at age 50 should start with 1.000.) Calculate and plot the hazard function for 50-year-olds and compare it with that for 10-year-olds. What general conclusions do you draw? 2.5 1000 units are stressed on test for 1000 hours. Readouts to check for failures take place at 24, 48, 96, 168, 500, and 1000 hours. Cumulative failures noted at these readouts are 19, 28, 32, 34, 35, and again 35, respectively. The remaining 965 test units survive the 1000-hour test. Estimate the CDF and the reliability function at the end of each readout interval. Estimate the hazard rate (or the AFR) during each interval.

3 Exponential Distribution The exponential distribution is one of the most common and useful life distributions. In this chapter, we discuss the properties and areas of application of the exponential and then we look at how to estimate exponential failure rates from data. A method for calculating upper and lower confidence bounds is included. Detailed examples illustrate how to solve many important problems of experimental test planning.

3.1 Exponential Distribution Basics The probability density function (PDF) for the exponential

f (t) = λ e − λt

and the cumulative distribution function (CDF)

F(t) = 1 − e − λt

were introduced in Chapter 1. Figures 1.3 and 1.4 gave plots of these functions. In both of these equations, λ is the single unknown parameter that defines the exponential distribution. If λ is known, values of F(t) can be calculated easily for any t. Only an inexpensive calculator or a table of natural logarithms is needed (the exponential function is the inverse function of the natural logarithm or eln(x) = x). Alternatively, the spreadsheet function EXPONDIST(t, λ, 0 or 1) may be used, with the desired t and λ values inserted. The third argument (0 or 1) determines whether the PDF or the CDF is provided, respectively. In Excel, for example, EXPONDIST(0.5,1,0) returns 0.606531 and EXPONDIST(0.5,1,1) gives 0.393469. In OpenOffice Calc, the same expression applies with commas replaced by semicolons. Since R(t) = 1 − F(t) = e−λt, the failure rate function (using the definition of h(t) given in Chapter 2) for the exponential distribution is

h(t) =

f (t) λ e − λt = − λt = λ R(t) e

This result shows that the exponential failure rate function reduces to the value λ for all times. This is a characteristic property of the exponential distribution. The only distribution with a constant failure rate function is the exponential (see Example 3.2). Figure 3.1 shows how this failure rate looks when plotted against time. The units for λ are failures per unit time, which are consistent with whatever units time is measured in. Thus, if time is in hours, λ is in failures per hour. If time is in thousand 47

48

Applied Reliability

h (t) λ

t = time FIGURE 3.1 The exponential distribution failure rate h(t).

TABLE 3.1 Equivalent Failure Rates in Different Units Failures/Hour

%/K

0.00001 0.000001

1.0 0.1

0.0000001 0.00000001 0.000000001

0.01 0.001 0.0001

FIT 10,000 1000 100 10 1

Failures per hour × 105 = %/K. Failures per hour × 109 = FIT. %K × 104 = FIT.

hour units (abbreviated as K), then λ is in failures per K. Often, however, failure rates are expressed in percent per K; this rate has to be converted to failures per unit time before making any calculations using the exponential formula. The same is true if failure rates are measured in FITs, which are failures per 109 device hours, as explained in Section 2.6 on units. Table 3.1 shows some examples of how to convert between these systems of units. EXERCISE 3.1 The average failure rate, or AFR, between time t1 and time t2 was defined in Chapter 2 to be [lnR(t1) − lnR(t2)]/(t2 − t1). Show that for the exponential, this expression reduces to the constant λ. EXAMPLE 3.1 EXPONENTIAL PROBABILITIES A certain type of transistor is known to have a constant failure rate with λ = 0.04%/K. What is the probability one of these transistors fails before 15,000 hours of use? How long do we have to wait to expect 1% failures?

Solution First, we convert λ to failures per hour units by multiplying by 10 −5 to get λ = 0.0000004. The probability of failure by 15,000 hours is F(15,000) or

1 − e −.0000004×15000 = 0.006 = 0.6%

We could also use the OpenOffice Calc spreadsheet function EXPONDIST to get the same result by typing EXPONDIST(15000;0.0000004;1).

49

Exponential Distribution

We find the time corresponding to any F(t) by inverting the formula for the CDF, solving for t in terms of the proportion F(t). This inversion gives

t=

− ln(1− F (t )) λ

Substituting 0.01 for F(t) and 0.0000004 for λ gives t = 25,126 hours. EXAMPLE 3.2 CONSTANT FAILURE RATE Show that a constant failure rate implies an exponential distribution model.

Solution The basic identity relating failure rates to CDFs was derived in Chapter 2. The general formula is t

F (t ) = 1− e

∫

− h( y ) d y 0

If h(y) = λ, a constant, the integral becomes t

∫ λ dt = λt 0

Then, we have F(t) = 1 − e−λt, or the exponential CDF.

The theoretical model for the shape we expect exponential data to resemble, when plotted in histogram form, is that of the PDF f(t) = λe−λ t. This formula was shown previously in Chapter 1 as Figure 1.3. Sample data that are suspected of coming from an exponential distribution can be plotted in histogram form, as described in Chapter 1, and compared in shape to this ideal form. An alternative way of looking at data to see if they appear exponential will be discussed in Chapter 6 (graphical plotting). EXAMPLE 3.3 EXPONENTIAL DATA In high-density integrated circuits (ICs), soft errors, as opposed to hard failures, are failures in which the device remains functional but the memory information becomes corrupted. Soft errors, which are induced by background radiation in the form of cosmic rays, are well-known failure mechanisms in dynamic random access memory devices (DRAMS; see, for example, Ziegler 1996). The IC company Reliable Memory makes memory chips that they believe are less susceptible to soft errors during normal use. They want to guarantee their product for 2 years of operation. The quality department is given 3 months to determine what such a guarantee is likely to cost. Fortunately, the engineer has devised what is called an accelerated test plan as a verified way of stressing DRAMS (using higher-than-normal radiation dosage levels) that can simulate a month of typical field use in 1 hour of laboratory testing. He is able to take a random sample of 100 memory chips and test them all until failure in less than 3 months. He does this experiment and records the equivalent typical user month of failure for each chip. The sample data are given in Table 3.2. The test engineer wants to use an exponential distribution for chip failures based on his past experience.

50

Applied Reliability

Solution The first step in analyzing the data is to decide whether an exponential is a reasonable model. Using the techniques described in Chapter 1, we will construct a histogram for the data. First, however, we note how much larger our intervals will have to be in order to include the last point because of its distance from the other data points. So we ignore temporarily the last point and use the range 547 − 1 = 546 instead of the actual range of the data. Dividing this range by 10 gives 54.6, or 55, for a cell width. For ease of calculation, we will start the first interval at 0 and go up to, but not including, 55. The second interval is from 55 to, but not including, 110, and so on. Table 3.3 contains the frequency table for these intervals. By comparing the shape of the histogram in Figure 3.2, produced using Minitab, to the exponential PDF shape in Figure 1.3, we see that an exponential model is a reasonable choice for the data. In Sections 3.6, 3.8, and 3.9, we will show how to choose an estimate of λ from the data and test statistically whether an exponential model with that λ is an acceptable fit. This example also introduced a very useful concept: accelerated testing. How does one verify that testing 1 hour at a given condition is equivalent to 10, 100, or some other number of a typical use condition? What is the mathematical basis behind “acceleration” and what data analysis tools are needed? Chapter 7 will deal with this important topic. TABLE 3.2 Sample Data of Equivalent Month of Memory Chip Failure 1 11 20 40 64 82 105 164 214 365

2 13 21 41 65 86 109 167 236 369

2 15 21 46 65 87 109 170 238 389

3 16 24 47 65 89 120 178 240 404

4 17 27 48 67 94 123 181 265 427

5 17 29 52 76 96 141 191 304 435

7 18 30 54 76 100 150 193 317 500

8 18 37 54 79 101 156 206 328 522

TABLE 3.3 Frequency Table of Memory Chip Data Cell Boundaries 0–55 55–110 110–165 165–220 220–275 275–330 330–385 385–440 440–95 495–550 … 880–935 Total count

Number in Cell 38 25 8 10 4 3 4 4 0 3 … 1 100

9 18 40 55 80 102 156 211 355 547

10 20 40 55 80 104 161 212 363 889

51

Exponential Distribution

40

38

30

Frequency

25 20

10

8

10 4

0

3

4

4

3 0

0

55

0

0

0

0

0

0

1

110 165 220 275 330 385 440 495 550 605 660 715 770 825 880 935 Month of failure

FIGURE 3.2 Histogram of memory chip failure data.

3.2 The Mean Time to Fail for the Exponential The mean time to fail (MTTF) for a life distribution, as defined in Chapter 1 and discussed in Chapter 2, is the population average or mean time to fail. In other words, a brand new unit has this expected lifetime until it fails. For the exponential, the MTTF is ∞

MTTF =

∫

∞

R(t) dt =

0

∫e

− λt

dt =

0

1 λ

We interpret this result as follows: the MTTF for a population with a constant failure rate λ is the reciprocal of that failure rate or 1/λ. Although 1/λ is the average time of failure, it is not the same as the time when half the population would have failed. This median time to failure, or T50, was described in general in Chapter 1. For the entire population, the median is defined to be the point where the CDF function first reaches the value 0.5. For the exponential, we have

F(T50 ) = 0.5 = 1 − e − λT50

Taking natural logarithms and solving for T50 yields

T50 =

ln 2 0.693 = λ λ

52

Applied Reliability

Note that the T50 for this distribution is less than the MTTF since the numerator is only 0.693 instead of 1. In fact, when time has reached the MTTF, we have

F(MTTF) = 1 – e− λ/λ = 1 – e ≈ 0.632

This result shows that approximately 63.2% of an exponential population with failure rate λ has failed by the time the MTTF 1/λ is reached. EXAMPLE 3.4 MEAN TIME TO FAIL A company manufactures components that are known to fail according to an exponential distribution with failure rate λ = 0.15%/K. What is the probability any one component will survive 20,000 hours of use? What is the probability it fails in the next 35,000 hours? What is the MTTF? At what point in time will 10% of these components be expected to fail? 50%? When will 63.2% have failed?

Solution First, we convert λ to failures per unit of time from %/K using the 10 −5 conversion factor (if λ is given in FITs, the factor is 10 −9). Then, λ = 0.15 × 10 −5 or 0.0000015. The probability of surviving 20,000 hours is R(20,000) = e−0.03 or 0.97. The probability of failing in the next 35,000 hours after surviving 20,000 hours is a conditional probability with value [F(55,000) − F(20,000)]/R(20,000) or [0.079 − 0.030]/0.97 = .051. The MTTF is 1/λ or 1/0.0000015 = 666,667 hours. We find out when 10% will fail (known as the 10th percentile) by solving the inverse equation for t when F(t) = 0.1.

t=

− ln(1− F (t )) − ln(1− 0.1) = = 70, 240 ho ours λ 0.0000015

Finally, the time when 50% of the population has failed is the median or T50 point given by 0.693/λ = 462,000 hours, and the 63.2% point is the MTTF or 666,667 hours. In the last example, we calculated the probability of failing in 35,000 hours after surviving 20,000 hours. This turned out to be .051. But the probability of a new component failing in its first 35,000 hours, or F(35,000), also equals .051. Previous stress time does not seem to make any difference. This property of the exponential distribution is discussed in Section 3.3. EXERCISE 3.2 Consider the exponential distribution with MTTF = 50,000 hours. What is the failure rate in %/K hour? What is the time to 10% failure? To 50% failure?

3.3 The Exponential Lack of Memory Property The constant failure rate is one of the characteristic features of the exponential. Closely related is another key property called the exponential “lack of memory.” A component following an exponential life distribution does not “remember” how long it has been operating. The probability it fails in the next hour of operation is the same if it is new, 1 month old, or several years old. There is no aging, wear out, or degradation with time or use (at least in a probabilistic sense). Failure is a chance happening—always at the same constant rate and unrelated to accumulated power-on hours.

53

Exponential Distribution

The equation that describes this property says that the conditional probability of failure in some interval of time of length h, given survival up to the start of that interval, is the same as the probability of a new unit starting at time zero failing in the first interval of length h:

P(fail in next h|survive t) = P( new unit fails in h)

In terms of the CDF, this relation becomes

F(t + h) − F(t) = F( h) 1 − F(t)

This equation holds if F(t) is the exponential CDF, as the reader can easily verify. It can also be shown that only the exponential has this property for all t and h (see Feller 1968, page 459). The implications of this concept from a testing point of view are highly significant. We gain equivalent information from testing 10 units for 20,000 hours or from testing 1,000 units for 200 hours (or even 1 unit for 200,000 hours). All device hours are equivalent. If a unit fails on test, we have the option of repairing or replacing it and continuing the test without worrying about the fact that some of the test units have a different age from that of other units. If engineering judgment says that the above testing equivalences seem wrong in a particular case, then we are really saying that we do not believe that the exponential distribution applies. However, when the exponential can be used as a reasonable model for the data and the type of item on test, these advantages in test design are valid. Another consequence of the lack of memory property is that the renewal rate (defined in Chapter 2, Section 2.8) for repairable systems and the failure rate (as well as the AFR) for nonrepairable components are all numerically equal with value λ. This fact takes away concern over the issue of which rate applies in a particular situation (discussed in Chapter 2). We can even define an expected time between failures for a repairable exponentially distributed unit. This mean time between failures (MTBF) is again 1/λ. Also, as one would expect, the ResidualMTTF(T0), defined in Chapter 2 as the expected time to fail or MTTF of a component that has already lived T0 hour, is again 1/λ EXERCISE 3.3 Given that a memory chip lifetime for soft errors follows the exponential distribution with MTTF = 36 months, find the probability of failure in 10 months. Next, find the probability of failure in the following 10 months, given survival to 10 months.

3.4 Areas of Application for the Exponential If we feel that a unit under test has no significant wear-out mechanisms, at least for its intended application life, and either we do not expect many early defect failures or we intend to separate these out and treat them separately, then the exponential is a good initial choice for a life distribution model. Note the words “no significant wear-out mechanisms, at least for its intended application life.” Even though we know that, in every imaginable real-world case, some kind of

54

Applied Reliability

wear out eventually takes place, we can ignore this consideration if we feel it is not a practical concern. Having a nearly constant failure rate over the region of time we are interested in (and confining our testing time to this region or its equivalent under test acceleration) is all we need to use the exponential. Another application of the exponential is in modeling the long flat portion of the bathtub curve. Since actual failure data on many systems, subassemblies, or even individual components have a nearly constant failure rate for most of the product life, exponential methods of data analysis can be used successfully. In fact, if we specify product performance over an interval of time using an AFR, and we are not particularly concerned with how the failures spread out over that interval, then we can use an exponential assumption and the exponential confidence bound factors (to be described later in Section 3.10.). This feature extends the usefulness of the exponential to many cases where it would not be the correct theoretical model. Just as any smooth curve can be approximated piecewise by straight lines to any degree of accuracy required, we can consider a changing failure rate curve to be constructed of many piecewise constant exponential portions. Then, we can analyze data from within any one piece, or interval, as if it were exponential. EXAMPLE 3.5 PIECEWISE EXPONENTIAL APPROXIMATION

h(t)

Figure 3.3 illustrates how a typical bathtub curve showing the actual failure rate for a given product might be piecewise approximated by constant failure rate (exponential distribution) segments. Each constant failure rate value equals the AFR over that section of the bathtub curve. The time points might be months or quarters in the field. The AFR estimate would be the number of failures occurring from units beginning that age range divided by the total hours of operation within that age range of all shipped units (as described in Section 3.6). Later in Section 3.12, we will see that when the exponential assumption applies, preplanning experiments is a very straightforward process. For example, sample sizes, confidence levels, and precision can be decided in advance. It will be much more difficult—or even impossible—to do similar exact planning for other life distribution models. For this reason, the exponential is also useful as a trial model in the experimental planning stage, even if we do not expect it to actually apply later on.

Time FIGURE 3.3 Piecewise approximation of actual failure rate.

55

Exponential Distribution

One must be careful, however, when setting up an experiment designed to accelerate a known wear-out mechanism. If the purpose is to determine whether wear out will, indeed, start well beyond useful product life, then we are clearly interested in the nonflat rising portion of the bathtub curve. Exponential analysis methods would not apply, and preplanning should be based on the actual life distribution model, if at all possible.

3.5 Exponential Models with Duty Cycles and Failure on Demand In normal use, some components or systems (e.g., a lightbulb or air conditioner) may experience duty cycles during which operation in different states occurs. For example, there might be an idle state and one full power-on state during a typical duty cycle. A particular state may last for part of an hour or day or longer. For simplicity, assume that in every cycle, the component spends an average of toff hour in an idle state and ton hour in a fully on state. Also, assume that the component follows an exponential failure model with failure rate λoff when idle and failure rate λon when fully on. What model best matches the overall failure rate for this component? Because of the exponential lack of memory property, every time the component begins a new cycle, it has the same chance of failing during that cycle as it did during any earlier cycles. The probability of survival for the duration of any cycle is given by the product of the independent probabilities of surviving the idle state and the on state. This result is given by P { survive during a cycle} = Roff (toff )Ron (ton ) = e − λoff toff e − λonton

=e

−

( λ off toff + λ on ton )( toff + ton ) toff + ton

= e − λeffective (toff +ton ) From this, we see that an effective failure rate of

λ effective =

λ off toff + λ on ton toff + ton

will produce the same number of failures per cycle. Obviously, we could include additional states of operation within each cycle as long as they are also characterized by another exponential failure rate for a typical average time of operation during every cycle. We can make this situation even more general by including the possibility that the component fails when switching from an idle state to an on state. If there is a probability p of failure (due to current surges, perhaps) when the component turns “on,” then the probability of surviving a cycle is

(1 − p)e− λoff toff e− λonton

and the effective (equivalent exponential) failure rate can easily be shown to be

λ effective =

− In(1 − p) + λ off toff + λ on ton toff + ton

56

Applied Reliability

Since −ln(1 − p) is approximately equal to p (for small p), this effective failure rate is often given as

λ effective ≈

p + λ off toff + λ on ton toff + ton

The possibility of failure occurring when the component is turned on is known as failure on demand. For a three-state example, assume that every system cycle includes toff hour with exponential failure rate λoff, tidle hour with exponential failure rate λidle, and ton hour with exponential failure rate λon. During the switch from off to idle, there is a probability p1 of failure and during the switch from idle to fully on, there is a probability p2 of failure. Then, the approximate equivalent effective single exponential failure rate per cycle would be

λ effective ≈

p1 + p2 + λ off toff + λ idle tidle + λ on ton toff + tidle + ton

EXERCISE 3.4 Every 24 hours, a component spends an average of 8 hours in a fully off state, 10 hours in an idle state, and 6 hours at full power. While in the off, idle, and full-power states, the component follows an exponential life distribution with failure rates per hour of 0, 0.00001, and 0.00005, respectively. When switching from off to idle (once each day), the component has a 0.00015 chance of failure. When switching to full power (once each day), the component has a 0.0005 chance of failure. What is the effective single exponential failure rate that leads to the same probabilities of failure, day by day, for this component? What is the probability the component survives 90 days?

3.6 Estimation of the Exponential Failure Rate λ When data come from an exponential distribution, there is only one parameter, λ, to estimate. The best estimate for complete or censored samples is

number of failures λˆ = total unit test hours

The denominator is the sum of all the operation hours of every unit on test, including both failures and those that have completed the test without failing. For a complete sample (everything fails and exact times of failure are recorded), this expression reduces to the reciprocal of the sample mean. Thus, we have λˆ = 1/(sample mean time to failure), just as we had λ = 1/MTTF. For censored Type I data (fixed test time), with r failures out of n on test

λˆ =

r r

∑ t + (n − r)T i

i=1

57

Exponential Distribution

T is the prefixed end of test time and t1 , t2 , t3 , … , tr are the exact failure times of the r units that fail before the test ends. If the test is censored Type II (ends at rth failure time tr ), the same rule yields

λˆ =

r r

∑ t + (n − r)t i

r

i=1

It should be noted that the estimation rule described here for exact times yields maximum likelihood estimates (see definition in Chapter 4 and Example 4.4). If new units are put on test to replace failed units, or to increase the sample size part way through the testing, then we have multicensored data. Applying the general form for λ, the denominator is the sum of each test unit’s time on test. When we have readout data, we can no longer exactly calculate the denominator in order to estimate λ. In this case, the graphical methods described in Chapter 6 can be used, which also apply, as an alternate approach, when exact times are available. More precise techniques, based on the method of maximum likelihood estimation, will be described in Chapter 4. We can also apply the simple procedure described in Example 3.6 to readout data, often with little loss of accuracy. For example, if many units are on test with only a few failures, the error in λˆ assuming all failures occur in the middle of the readout interval will be negligible. With this assumption, the simple confidence bound procedures described in Section 3.10 can be used. EXAMPLE 3.6 FAILURE RATE AND MTTF Returning to the memory chip data of Example 3.3 and Table 3.2, the estimate of λ is λˆ = 100/13, 563 = 0.00737 chip fails per month. The MTTF estimate 1/λˆ is 135.63 months. On the other hand, if we had used the summarized readout data version of chip failure months given in Table 3.3, we would have calculated λˆ = 100/13, 585 = 0.00736 chip fails per month, leading to an MTTF estimate of 135.85 months. For this failure rate calculation, the denominator is obtained by multiplying the 38 failures in the first interval by the middle of the interval, or 27.5, and adding 25 × 82.5 for the second interval, and so on. For the one fail time after 550 months, we pretend that our 55-month intervals go on to 935 months, where the last fail is recorded. Placing this fail in the center of that last interval (880–935) at 907.5 months gives the total test time of 13,585 months. This example shows that using the failures divided by total test time formula on readout data is not likely to cause much error. EXERCISE 3.5 Perform the calculation described above using the interval data in Table 3.3 to estimate λ. One immediate application we can make of this λ estimate is to use it to calculate values of the PDF to compare to our histogram in Figure 3.2. In other words, the histogram should have the same shape as f(t), which is f(t) = 0.00736e−0.00736t. Before graphing f(t), however, on the same chart with the histogram, we have to adjust scales. The total area under the f(t) curve is always 1. Because the intervals have width 55 and the height units are in percent, the histogram plot has an area of 100 × 55 or 5500 (if we put one more box of height 1 between 880 and 935). In Figure 3.4, we plot 5500 × f(t) along with the memory chip data histogram. This graph gives us a direct

58

Applied Reliability

Frequency of percent

60 50 40 30 20 10 0

0

100

200

300 Month of failure

400

500

600

FIGURE 3.4 Memory chip data histogram compared to f(t) shape.

shape comparison to show that an exponential model applies. In Section 3.8, we will test this fit with the chi-square goodness-of-fit test. EXERCISE 3.6 Using the data in Table 3.2, construct a histogram based on cell width 35. Estimate λ from the interval data. Fit f(t) to the histogram and compare to Figure 3.4.

3.7 Exponential Distribution Closure Property The exponential distribution possesses a convenient closure property that applies to assemblies or systems made up of independent and exponentially distributed components. If the system fails upon failure of the first component, then the system life distribution is also exponential. The failure rate parameter for the system is equal to the sum of the failure rate parameters of the components. A system model where n components operate independently and the system fails with the first component failure is called a series model. This model is discussed in Chapter 10, where it is shown that the system failure rate function hs(t) is the sum of the n component failure rate functions h1(t), h2(t), … , hn(t). When the components have exponential lifetimes with parameters λ1, λ2, … , λn, then the system has a constant failure rate equal to n

λs =

∑λ

i

i=1

This result establishes the exponential closure property since a constant failure rate implies an exponential distribution. If the components are all the same, each having failure rate λ, then the system has failure rate nλ and an MTTF of 1/nλ. This exponential closure property is almost unique. In general, the minimum lifetime of n components will not have the same type life distribution as the components themselves. (There is a more general case of this closure property: when all the components have Weibull distributions with the same shape parameter. See Chapter 4.)

Exponential Distribution

59

EXERCISE 3.7 A system consists of 20 serially connected independent components, each of which has a lifetime described by the exponential distribution with an MTTF = 1500 hours. What is the probability of the system failing in the first 100 hours of operation?

3.8 Testing Goodness of Fit: The Chi-Square Test There is a standard statistical test, known as the chi-square (χ2) goodness-of-fit test, for deciding whether sample data are consistent with a preconceived model. Our basic, or null, hypothesis is that the model is adequate. We then pick a confidence level such as 90% or 95%. The higher the level, the stronger the negative evidence required from the sample data before we will be willing to reject our hypothesis. For example, if we set our confidence level at 90%, we are saying that we need sample data that are so unusual they would occur less than 10% of the time if our model really is correct. If we obtain such unlikely sample data, we will reject the model. If we set our confidence level at 95%, we require data so unlikely that they occur less than 5% of the time if the model is correct, before we feel comfortable rejecting the model. In other words, what we really control by picking a confidence level is the probability of making a mistake by rejecting a good model. This probability of error is known as the alpha or Type I error and has value less than (100% – confidence level in %; a more detailed treatment of the probabilities of making decision errors based on sample data is given in Section 11.11.) In the chi-square goodness-of-fit test, we go through a set procedure (enumerated below) and calculate a number. If our chosen model is correct, this random quantity we have calculated (“random” because its value depends on the random sample data) will have an approximate chi-square distribution, a well-known distribution. The probability (i.e., the CDF) for the calculated number can be easily determined by the use of the spreadsheet function CHIDIST(number to evaluate, degrees of freedom). To find the percentile corresponding to a given probability, we apply the inverse function CHIINV(probability, degrees of freedom). If the calculated number turns out so large that it is a highly unlikely value, then we reject the model. Any value higher than the confidence level percentile is defined to be “highly unlikely” enough. The above discussion may sound somewhat abstract and academic. However, it does have a very serious consequence that is not always understood or appreciated by those who use statistical hypothesis testing methods. When we perform a statistical test and end up not rejecting the assumed hypothesis, we have not proved the hypothesis is correct. The terminology “accept” the hypothesis and “90% confidence level” may mislead us into thinking we are confident our hypothesis is true. The statistical test gives us no such confidence. We started with what we thought was a reasonable model, and all we have shown is that the sample data have not changed our minds. On the other hand, when we reject, we are making a strong statement of belief. We are saying that the sample evidence was so overwhelmingly against the model we had chosen that we have to reconsider our choice. When we use statistical software such as JMP or Minitab to carry out a statistical hypothesis test, the software will typically not ask us to choose a confidence level. Instead, most statistical software packages will give what is known as a p value. We pick a confidence level and compare this p value to .01 × (100% – confidence level in %). If the

60

Applied Reliability

p value is smaller than this maximum allowable Type I error, called the significance level, we reject the hypothesis. Thus, the smaller the p value, the more the evidence we have for rejection, and it is up to the researcher to choose the minimum amount of evidence needed. Readers who desire more background on the theory of hypothesis testing should consult a basic statistics textbook such as that by Wackerly, Mendenhall, and Scheaffer (2008). We now turn to how to use the chi-square test to determine whether sample data are consistent with a distributional assumption such as exponential. The steps for testing goodness of fit are as follows:

1. Group the data, if necessary, into intervals as if preparing to plot a histogram. The intervals need not be equal, however. Form a table with the interval and the observed frequency, as in Table 3.3 for the memory chip data. 2. If the model is a completely specified distribution (i.e., no unknown parameters), go on to Step 3. Otherwise, estimate all the unknown parameters, using the method of maximum likelihood. In the case of the exponential distribution, we estimate λ using the method previously described. At this point, the model is completely specified. 3. Use the CDF implied by the model to calculate the probability of failure in each of the intervals listed in the table. If the interval is (I1, I2), this probability is F(I2) − F(I1). Add a column with these probabilities to the table. 4. Multiply the probabilities just calculated (which should total to 1) by the sample size on test. This calculation results in an expected number of failures for each interval. Add a column of these numbers to the table. 5. If any intervals have less than five expected failures (not actual) as just calculated, it is generally better to combine them with other intervals until every number in the expected column is five or greater. 6. Now calculate the following: (actual number of failures − expected number of failures)2 expected number of failures

for each of the remaining intervals on the table. Add these numbers together. This total is the χ2 test statistic. 7. Take one less than the number of separate intervals in the table (or the number of terms added together in Step 6, less one) and subtract the number of parameters estimated in Step 2. The number obtained is the degrees of freedom of the χ2 test statistic. 8. Determine the significance of the resulting χ2 statistic by using the spreadsheet function CHIDIST( χ2 statistic, degrees of freedom), which provides directly the p value for the calculated degrees of freedom. If the p value is less than the maximum significance level chosen, reject the model. If not, continue to use the model, but note just how close you were to rejecting it. The grouping in Step 5 is necessary because the test statistic calculated in Step 6 has only an approximate chi-square distribution. As long as each interval has at least five expected

61

Exponential Distribution

failures, the approximation is good enough to be useful; for under five expected failures, the chi-square approximation may be inappropriate. EXAMPLE 3.7 CHI-SQUARE GOODNESS OF FIT Suppose we wish to test whether the memory chip failure month data in Table 3.2 are well modeled by an exponential distribution. Table 3.4 shows the worksheet obtained from applying the first four steps in calculating a chi-square test statistic. The estimate of 0.00736 obtained in Example 3.6 has been used. The probability value for the 55–110 interval will be calculated in detail to illustrate the procedure. For the exponential model, F(55) = 0.3334 and F(110) = 0.5556. Hence, in the interval 55–110, the failure probability is 0.5556 − 0.3334 = 0.2222, and we expect 22.2 failures versus the observed 25. To have 5 or more failures in every interval, we can combine intervals 275–330 and 330–385 to have an expected 7.31 versus an actual 7. We also have to combine the last remaining intervals into a single “greater than 385” interval with 1.95 + 1.30 + 0.87 + 1.63 = 5.75 expected and 4 + 0 + 3 + 1 = 8 observed. The calculation described in Step 6 becomes

(38 − 33.3)2 ( 25 − 22.2)2 (8 − 14.8)2 (10 − 9.87)2 (4 − 6.58)2 (7 − 7.31)2 (8 − 5.75)2 + + + + + + 5.75 33.3 22.2 14.8 9.87 6.58 7.31 = 0.65 + 0.35 + 3.13 + 0 + 1.01+ 0.01+ 0.88 = 6.04 This test statistic was a sum of 7 terms, and the rule for degrees of freedom given in Step 7 says to subtract 1 and subtract another 1 because λ was estimated from the data. The result is 5 degrees of freedom. We now use the spreadsheet function CHIDIST(6.04,5), which returns the value .30, or 30%, for the p value. If a significance level of .10 or 10% had been specified, we see that the resulting chi-square probability is well above this 10%, and consequently, there is no reason to reject an exponential model for the population from which the data are derived. It should be mentioned that both Excel and OpenOffice contain a built-in spreadsheet function called CHITEST(actual range, expected range), with the two arguments covering the expected and actual arrays, which must manually be adjusted for the requirement to have greater than five in each interval. However, the degrees of freedom used by this function are one less than the number of rows. Thus, the program does not account for the degree of freedom applied in the estimation of λ, and the resulting answer will differ from the correct one shown above.

TABLE 3.4 Chi-Square Goodness-of-Fit Worksheet for the Memory Chip Data Interval (I1, I2) 0–55 55–110 110–165 165–220 220–275 275–330 330–385 385–440 440–495 495–550 >550

Actual Failures

P(fail in interval) F(I2) − F(I1)

Expected Failures

38 25 8 10 4 3 4 4 0 3 1

0.334 0.223 0.148 0.099 0.066 0.044 0.029 0.019 0.013 0.009 0.017

33.4 22.3 14.8 9.9 6.6 4.4 2.9 1.9 1.3 0.9 1.7

62

Applied Reliability

EXERCISE 3.8 50 components are placed on a life test. The recorded times to failure, in hours, are 16.2 143 231 439 609 703 1047 1215 1635 2301

18.9 154 292 441 610 836 1130 1240 1651 2496

68.4 201 366 490 662 957 1131 1275 1694 2827

84.7 212 371 548 662 1020 1137 1410 1698 3142

112 229 412 561 673 1045 1179 1512 1973 4966

The sample data are suspected of coming from a population of lifetimes that can be adequately modeled by an exponential distribution. Analyze the sample data as follows:

a. Construct a histogram using cell intervals of size 500 (from 0 to 5000). b. Estimate λ and the MTTF using the actual data points. c. Estimate the median time to fail using the population formula given in the text and the estimate just calculated for λ. d. Test the adequateness of the exponential model using the chi-square goodness-offit test and a confidence (or significance) level of 90%. Use the same intervals as in part a, combining everything greater than 2000 hours into one group. e. Suppose the test had been planned to stop upon the fifth failure and hence ended at 112 hours. What would the estimated MTTF be for this experiment? Suppose instead the test had been planned to stop at 150 hours. What MTTF estimate would have resulted?

3.9 Testing Goodness of Fit: Empirical Distribution Function Tests The chi-square goodness-of-fit test is easy to calculate, and tables and spreadsheet functions for the chi-square distribution are readily available. Although the use of the chi-square distribution is based on large sample theory, as long as one has a reasonable number of intervals with five or more expected failures, a test rejection is good evidence that the exponential distribution (or any other model you may be fitting) does not match the sample data. However, the significance level will depend on the width and number of the binning intervals (see Exercise 3.9). Consequently, what if the chi-square test does not reject? Are there other test statistics that might be more capable of rejecting an exponential when, say, the true underlying distribution looks more like a normal or a lognormal? In other words, are there different test statistics we could apply that would be more sensitive and consistent at detecting departures from our model assumption than the chi-square procedure, at least in some practical cases? In fact, there are a collection of powerful test procedures based on comparing the CDF for the chosen model and the estimated empirical distribution function (EDF) for the sample data (the EDF was defined in Section 1.4). When the CDF is completely specified, that is, the distribution and all parameters are known and we have complete data, EDF statistics are in general more powerful tests than chi-square. By “more powerful,” we mean that these procedures are usually more sensitive and capable of rejecting the hypothesized or

63

Exponential Distribution

chosen model distribution when the chi-square test might not have enough evidence for a rejection. Even for those cases in which parameters are estimated or the data are censored, EDF statistics can be applied. Several of these procedures are built into software such as Minitab and JMP. However, these EDF tests involve extensive computation and often rely on derived percentile values that are only known for a few distributions. EDF tests start with a sample t1, t2, … , tn that has been put into ascending order. The ordered sample is denoted by t(1), t(2), … , t(n), where the t(i) are known as order statistics. If we have previously decided that F0(t) is our choice for the population CDF the sample was taken from, we can calculate several different EDF statistics. The following formulas are provided for reference so that a reliability analyst can understand what a software package is calculating when it does a goodness-of-fit test using an EDF-based method. We will illustrate some applications in Example 3.8. The most common EDF goodness-of-fit test is the Kolmogorov–Smirnov (KS), also known as the Kolmogorov–Smirnov–Lilliefors (KSL) test, that make use of D-statistics. 3.9.1 D-Statistics: Kolmogorov–Smirnov i i−1 − F0 (t( i ) ) ; Dn− = max F0 (t( i ) ) − 1≤ i≤ n 1≤ i≤ n n n D = max ( Dn+ , Dn− )

Dn+ = max

1≤ i ≤ n

Because the EDF is a step function, the Dn+ and Dn− statistics measure the maximum deviation between the model distribution F0(t) and the EDF at the tops and bottoms of steps, respectively. Thus, we see that the KSL D-statistic measures the maximum deviation between the model distribution F0(t) and the empirical CDF. As a simple example, suppose we are interested in a simulation study requiring the generation of random numbers from a uniform distribution defined on the unit interval [0,1]. We have available a software program that should generate what are called pseudorandom numbers using a special algorithm. The program runs and generates n values. We plot the EDF and the assumed model distribution F(t). Is there any evidence that the results are not consistent with a random sample from the uniform distribution model? Figure 3.5 1

F(x) = x EDF = i/n

EDF and CDF

0.8 0.6

D−

0.4 D+

0.2 0

FIGURE 3.5 Illustration of

Dn+

and

Dn−

0

0.2

statistics for KS test.

0.4

x

0.6

0.8

1

64

Applied Reliability

is a representative illustration of the D+ and D− statistics for a KS test of a random sample of n = 3 items from an assumed unit uniform distribution. We mention two other EDF statistics often applied in goodness-of-fit tests. 3.9.2 W2-Statistics: Cramer–von Mises n

W = 2

∑ i=1

i − 0.5 F0 (ti ) − n

2

+

1 12 n

3.9.3 A 2-Statistics: Anderson–Darling

1 A = − n+ n 2

n

∑ (2i − 1) ln(F (t 0

(i)

)) + (2 n + 1 − 2 i) ln(1 − F0 (t( i ) ))

i=1

The W 2 is used in tests labeled Cramer–Von Mises statistics, and A2 is used in AndersonDarling (AD) tests. The KSL tests work reasonably well against any model for F0(t), and the AD tests are especially powerful for testing when EDF departs from the true distribution F0(t) in the tails. In reliability analysis, detecting departure in the tails is generally important for goodness of fit and so A2 is the recommended statistic. The more common D-statistic is conceptually simpler to visualize and compute but can be less powerful than W2 or A2. When F0(t) has parameters estimated by the maximum likelihood method, all EDF tests are approximate (like the chi-square test), and tabled values, when known, are based on simulation studies. By modification of the EDF statistics to incorporate the sample size n, Stephens (1974) has reduced the tables of critical values to single lines. We illustrate the use of such a table in Appendix 3A (see D’Agostino and Stephens 1986 for a comprehensive treatment of goodness-of-fit tests). Note that JMP and Minitab provide the significance levels for the EDF tests they use. EXAMPLE 3.8 GOODNESS-OF-FIT TESTS BASED ON EDF STATISTICS A computer manufacturer stress tested 25 vendor components in order to see whether they had either a wear out or an early failure mechanism, or if they could be modeled successfully by an exponential distribution with the lack of memory property. The 25 failure times (in hours) were 10.9, 19, 30.5, 43.7, 47.4, 63.8, 68.9, 69.4, 81.2, 82.4, 84.6, 87.5, 92.7, 111.7, 113.5, 121.7, 122.5, 138.3, 139.8, 157.1, 178.2, 194.3, 195.3, 215.4, and 218.7. Create a histogram of the data and roughly assess whether an exponential fit appears reasonable. Create an EDF plot and illustrate the KSL D-statistics graphically. Test goodness of fit to an exponential model with JMP and Minitab.

Solution A JMP histogram of the data is shown in Figure 3.6. An exponential fit does not seem appropriate. The EDF plot illustrating the KSL D-statistics is shown in Figure 3.7. Figure 3.8 shows the output of Minitab’s analysis of the failure times, using the Stat, Quality Tools, Individual Distribution Identification platform and selecting an exponential fit. The Minitab plot of the failure times is shown, along with the accompanying goodness-of-fit test using the AD-statistic (see Chapter 6 for the details of probability plotting). This test uses the EDF A 2 formula. The p value from the test is less than .01, indicating a strong rejection of the hypothesis that an exponential model provides an adequate fit for this collection of failure times.

65

Exponential Distribution

5

3

Count

4

2 1

0

25

50

75

100 125 150 175 200 225 250

FIGURE 3.6 JMP histogram of test data.

1

D+

0.8

EDF

0.6 EDF Exponential model

D − and D

0.4

0.2

0

0

50

100

150 t

200

250

300

FIGURE 3.7 Empirical distribution function plot and KSL D-statistics.

Figure 3.9 shows the same analysis carried out by JMP, using the Analyze, Distribution platform. JMP uses a KSL statistic to test the goodness of fit and calculates a p value of 0.0119, also indicating strong rejection of the hypothesis that an exponential model provides an adequate fit for this collection of failure times. Note that both probability plots provide strong visual evidence indicating lack of fit. Probability plots are an extremely important tool and should be used along with any analytic test for goodness of fit. The techniques for probability plotting are covered in Chapter 6.

66

Applied Reliability

Probability plot Exponential 95

Goodness of fit test

80

Exponential AD = 1.977 p value = .009

Percentage

50

20

5 2 1

1

FIGURE 3.8 Minitab exponential analysis of failure times.

FIGURE 3.9 JMP exponential analysis of failure times.

10

Failure time

100

1000

Exponential Distribution

67

EXERCISE 3.9 Perform a chi-square goodness-of-fit test on the data in Example 3.8 using binning intervals of width 20, that is, 0–20, 20–40, and so on. Repeat using intervals of width 25. Compare the significance levels. What does the result tell you about the choice of intervals for this test? EXERCISE 3.10 Use the JMP Distribution platform to fit an exponential distribution to the data in Table 3.2 and test for goodness of fit using the KSL D-statistic. Alternatively, use the Minitab Individual Distribution Identification platform (in Quality Tools) to fit an exponential distribution to the same data and calculate an AD-statistic to test for goodness of fit. (Neither test will have sufficient evidence to reject an exponential model, similar to the chi-square test result.)

3.10 Confidence Bounds for λ and the MTTF urs) is a single number or point estiThe estimate λˆ = ( number of failures)/(total unit test hou mate of λ. Although it may be the “best” estimate we can come up with, by itself it gives no ˆ Or, are we confident it is measure of precision or risk. Can the true λ be as high as 10 × λ? ˆ ˆ no worse than 1.2 × λ? How much better than λ might the true λ actually be? These are very important questions if critical decisions must be made that depend on the true value of λ. No presentation of the test results or a calculation of λˆ is complete without including an interval around λˆ that has a high degree of confidence of enclosing the true value of λ. This kind of interval is called a confidence interval. A 90% confidence interval means that if the same experiment was repeated many times, and the same estimation method was used over and over again to construct an interval for λ, 90% of these differing intervals would contain the true λ. For the one time we actually do the experiment, our interval either does or does not contain λ. Nevertheless, if it is a 90% confidence interval, then we would give 9 to 1 odds in its favor. Alternatively, unless a 1 in 10 chance has occurred, the 90% confidence interval will contain the true (population) value for λ. For complete Type I (time-censored) or Type II (rth failure-censored) data, factors based on the chi-square distribution can be derived and used as multipliers of λˆ to obtain the upper and lower ends of confidence interval of any size. The remarkable thing about these multiplying factors is they depend only on the number of failures observed during the test and not on the total sample size. Suppose we want a (1 − α), in percent, confidence interval for λ, where α is the risk we are willing to accept that our interval does not contain the true value of λ. For example, α = 0.1 corresponds to a 90% interval. We will calculate a lower α/2 bound for λ and an upper (1 − α/2) bound for λ. These two numbers give the desired confidence interval. When α is 0.1 or 10%, this method sets a lower 5% bound and an upper 95% bound, having between them 90% chance of containing λ. The lower end of this interval, denoted by λ5%, is a 95% single-sided lower bound for λ. The upper end, or λ95%, is a single-sided 95% upper bound for λ. The notation here can be confusing at first and the reader should work out several examples until it is clear. As an illustration, suppose we want a 95% interval for λ. Then, α = 0.05

68

Applied Reliability

and α/2 = 0.025. The interval will be (λ2.5%,λ97.5%). The lower end is a 97.5% lower bound for λ, and the upper end is a 97.5% upper bound. Now that we have defined our notation, how do we calculate these lower and upper bounds? When we have count-censored data (Type II), it can be shown that the lower α/2 percentile of the chi-square distribution with 2r degrees of freedom (r is the number of failures), divided by 2r, is a factor we can multiply by λˆ to get λα/2. Similarly, the upper (1 − α/2) percentile of the same chi-square, divided by 2r, is a factor we can multiply by λˆ to get λ(1 − α/2). The distribution statement corresponding to these factors is as follows: χ 2 = 2 rλ/λˆ has the chi-square distribution with 2r degrees of freedom (Type II censoring without replacement of failed units) and

P

λˆ (χ 22 r;α/2 ) λˆ (χ 22 r;(1−α/2 ) ) ≤λ≤ = 1− α 2r 2r

Note that this formula can also be used for complete failure data where r = n = the sample size. For the more common time-censored data (Type I), intervals using the above factors are approximately correct. Exact intervals can be calculated this way only if failed units are replaced immediately during the course of the test. In that case, the lower bound factor is exactly as above, while the upper factor uses the (1 − α/2) percentile of a chi-square with 2(r + 1) degrees of freedom, still divided by 2r. Since this chi-square factor produces a slightly more conservative upper bound, we recommend using it for Type I censoring. The probability statement is as follows: P

λˆ (χ 22 r;α/2 ) λˆ (χ 22( r +1);(1−α/2 ) ) ≤λ≤ ≥ 1− α 2r 2r

Using this for Type I censoring without replacement gives an upper bound equivalent to assuming that another fail occurred exactly at the end of the test and that we had Type II censoring with r + 1 fails. The factors described above depend only on the number of failures r. We will denote the factor that generates an upper (1 − α/2) confidence bound limit by kr;(1 − α/2). The corresponding lower α/2 confidence limit is kr;α/2. The key equation for calculating a (one-sided) upper bound is kr ;(1−α ) × λˆ = λ (1− α ) for a Type I censored experiment with r failures. We can easily compute any of these k-factors using the spreadsheet function CHIINV(probability, degrees of freedom). For example, in Excel, entering the expression (CHIINV(0.05,2 * (r + 1)))/(2 * r) calculates kr;95%, the one-sided Type I censoring 95% upper bound factor for r failures. The 95% lower bound factor kr;5% is calculated by (CHIINV(0.95,2 * r))/(2 * r). The expressions are the same using OpenOffice, except that the comma separator between arguments becomes a semicolon. Thus, for example, if we wanted a 90% two-sided confidence interval on the estimated failure rate λˆ = 500 FITs based on observing three failures under time-censored testing, the upper bound would be found by evaluating CHIINV(0.05,8)/6, which results in 2.585, and the lower bound comes from CHIINV(0.95,6)/6, which gives 0.273. Consequently, the 90% confidence interval on λ is (137,1293) FITs. Since the spreadsheet functions are readily accessible, we have not included k-factor tables in this edition. Instead, the spreadsheet functions are summarized in Table 3.5.

69

Exponential Distribution

TABLE 3.5 Spreadsheet Functions for k-Factors for Confidence Limits on the Exponential Failure Rate Type of Limits Single sided Upper limit Lower limit Double sided Upper limit Lower limit

Type I (Time Censored)

Type II (Count Censored or Complete Data)

CHIINV(α,2*(r+1))/(2*r) CHIINV(1−α,2*r)/(2*r)

CHIINV(α,2*r)/(2*r) CHIINV(1−α,2*r)/(2*r)

CHIINV(α/2,2*(r+1))/(2*r) CHIINV(1−α/2,2*r)/(2*r)

CHIINV(α/2,2*r)/(2*r) CHIINV(1−α/2,2*r)/(2*r)

EXAMPLE 3.9 CONFIDENCE BOUNDS FOR λ Compute 60% and 90% two-sided confidence bounds for λ using the data in Exercise 3.8b. Also, calculate a 90% interval for λ for the two cases given in Exercise 3.8e.

Solution A complete sample is “censoring Type II with r = n.” For the data in Exercise 3.8, λˆ = 100.35 %/ K and r = 50. A 60% two-sided confidence interval implies that α/2 = 0.2 and 1 − α/2 = 0.8. The upper bound k-factor is obtained by entering CHIINV(0.2,100)/100 into a spreadsheet cell, which returns 1.12. So the upper confidence limit on the true failure rate λ is 1.12 × 100.35 = 112.06 %/K. The lower bound k-factor comes from CHIINV(0.8,100)/100, which gives 0.88. The lower confidence limit is, thus, 0.88 × 100.35 = 88.25 %/K. The 60% confidence interval is thus (88.25, 112.06) in %/K. Similarly, the upper 90% bound factor is found by CHIINV(0.05,100)/100, which gives 1.24, resulting in the upper 90% limit for λ of 124.78 %/K. The lower bound k-factor comes from CHIINV(0.95,100)/100, which gives 0.78, resulting in a lower confidence limit of 78.2 %/K. The 90% confidence interval is (78.2, 124.78) in %/K. λˆ for the first case in Exercise 3.8e is 1/1068 = 0.0009363 fails per hour and r = 5. The k-factors are evaluated by CHIINV(0.05,10)/10 and CHIINV(0.95,10)/10, which give 0.39 and 1.83, respectively. The 90% interval is (0.000369,0.001714), given in fails per hour. λˆ for the second case in Exercise 3.8e is 1/1173.9 = 0.0008519. The censoring is Type I with r = 6. The k-factors are found using CHIINV(0.05,14)/12 and CHIINV(0.95,12)/12, which give 0.44 and 1.897, respectively, and the 90% interval is (0.000371, 0.001681), in fails per hour. EXERCISE 3.11 Experimental results, analyzed based on an exponential distribution of failure times assumption, give λˆ = 200 FITs . Censoring was Type I, and there were three failures during the test. Compute a 95% two-sided confidence interval for λ.

3.11 The Case of Zero Failures When a test ends after time T with none of the n test units having failed, the point estimate previously defined is zero; “0” is not a realistic estimate, as it does not even take into account the number on test. An upper (1 − α), in percent, confidence limit for λ is given by

λ (1−α ) =

χ 22 ;(1−α ) 2 nT

=

− ln α nT

70

Applied Reliability

TABLE 3.6 Exponential Zero Failure Estimates Percentile

Estimate

50 60 80 90 95 97.5 99

0.6931/nT 0.9163/nT 1.6094/nT 2.3026/nT 2.9957/nT 3.6889/nT 4.6052/nT

where χ22;(1−α) is the upper (1 − α) percentile of the chi-square distribution with 2 degrees of freedom. This formula can be derived from the formula for the chi-square distribution and also from a simple probability argument (see Problem 3.8 at the end of this chapter). The 50% zero failures estimate is often used as a point estimate for λ. This should be interpreted very carefully. It is a value of λ that makes the likelihood of obtaining zero failures in the given experiment similar to the chance of getting a head when flipping a coin. We are not really 50% confident of anything; we have just picked a λ that will produce zero failures 50% of the time. Table 3.6 gives zero fail λ(1−α) formulas for several percentiles. As shown in the equation above, chi-square tables are not needed since the easy formula λ(1−α) = (−lnα)/nT can be used to calculate any desired percentile. For example, assume 1000 devices from an exponential distribution are stressed for 2500 hours with no failures. At 90% confidence, we can state that an upper bound estimate on the failure rate is λ90 = (−ln 0.10)/(1000 × 2500) = 921 FITs. The interpretation of the results is as follows: If a random sample of 1000 devices from a population having an exponential distribution with a failure rate as large as 921 fits were stressed for 2500 hours, there is only a 10% probability of zero failures. Therefore, it is unlikely (at the 90% confidence level) that the true failure rate is as large as 921 fits. EXAMPLE 3.10 ZERO FAILURES ESTIMATION 200 samples from a population of units believed to have a constant failure rate are put on test for 2000 hours. At that time, having observed no failures, the experimenter puts an additional 200 units on test. Three thousand hours later, there were still no failures, and all the units were removed from test. What is a 50% estimate of the failure rate? A 95% upper bound? Use the general zero fails formula to calculate a 70% upper bound on the true failure rate.

Solution The total unit test is 200 × 5000 plus 200 × 3000 or 1,600,000 hours. The 50% estimate is 0.6931/1,600,000, which is 433 FITs. The 95% failure rate upper bound is 2.9957/1,600,000, or 1,872 FITs. For a 70% upper bond, α = 0.3, and the general formula yields λ70% = (−ln 0.3)/ 1,600,000 = 752 FITs. So far, all upper and lower bounds have been for the failure rate and not for the MTTF. But since the MTTF is 1/λ, we can work with λ and bounds on λ, and then take reciprocals to convert to MTTF estimates with bounds. Note that the reciprocal of the upper bound for λ becomes the lower bound for MTTF. EXAMPLE 3.11 CONFIDENCE BOUNDS ON MTTF 200 units were tested 5000 hours with four failures occurring at 820, 990, 2130, and 4060 hours. Give 95% upper and lower (two-sided) bounds on the failure rate and the MTTF. What difference would it make if the test was designed to end at the fourth failure?

71

Exponential Distribution

Solution The λˆ estimate is 4/988,000 = 0.405 %/K. Using the spreadsheet functions from Table 3.5 with r = 4 and α = 5%, we find the upper bound factor of 2.56 and the lower bound factor of 0.272. After multiplying by these factors, we obtain λ2.5 = 0.110 %/K and λ97.5 = 1.037 %/K. Taken together, these form a 95% confidence interval for λ. The 2.5% lower bound on the MTTF is 1/λ97.5 = 96,469 hours. The 97.5% upper bound on the MTTF is 1/λ2.5 = 760,709 hours. If the data were Type II censored at the fourth fail, the λˆ estimate becomes 4/803,760 = 0.498 %/K and we would have the same k-factor for the lower bound, resulting in λ2.5 = 0.136 %/K. The k-factor to obtain the upper bound would be 2.19, giving λ97.5 = 1.091 %/K. The 95% confidence bounds on MTTF would be (91677, 737486) hours. EXERCISE 3.12 Suppose 500 units are tested for 1000 hours with no fails. Estimate the 75% upper bound for λ in FITs using the general formula. Also, calculate the 50% estimate. Interpret these two estimates. EXERCISE 3.13 Using the 50% estimate for λ in Exercise 3.11, estimate the CDF for the exponential distribution at t = 1000 hours. Also, estimate the probability of zero failures among 500 units at 1000 hours for an exponential distribution with this failure rate. Repeat using the 75% estimate. What do you conclude?

We end this section by looking briefly at how JMP and Minitab estimate the exponential MTTF and compute confidence bounds. Both programs have Reliability/Survival platforms that allow the user to input multicensored failure data and get parameter estimates with confidence bounds. JMP provides maximum likelihood estimates by default. Minitab requires selecting the maximum likelihood estimation method. Examples using these platforms will be shown in Chapters 3, 4, and 5. For exact times of failure with either Type I or Type II censoring, JMP and Minitab will compute the same MTTF estimates as were found by the simple formulas shown in Section 3.6. However, both JMP and Minitab will give confidence bound limits that are different from each other, and neither will agree with those we obtained from the spreadsheet functions. The reason is as follows: Both JMP and Minitab use (different) large sample approximate confidence bound calculation methods that will be nearly in agreement with the methods shown in this text when there are a large number of failures. However, for small numbers of failures (say less than 15 or 20), the intervals given by JMP and Minitab will be optimistically narrow, and the wider intervals produced by the factors in Table 3.5 are more realistic. In addition, to handle the zero failure case, the specific equation must be set up using the calculator in Minitab or the formula editor in JMP.

3.12 Planning Experiments Using the Exponential Distribution Proper experimental planning, or good experimental design, is acknowledged to be one of the most important ingredients of successful experimentation. Unfortunately, design is often neglected in the case of reliability testing and modeling. The complexity of censored data and the difficult forms of many of the life distributions used in reliability analysis present problems in experimental design that are hard to overcome in most typical applications.

72

Applied Reliability

This planning difficulty is not present when using the exponential distribution. Here we can, and should, give early consideration to the sample sizes and test durations of the experiment. If we carefully state our objectives, we can plan the right experiment to achieve them. In most cases, only a simple determination of the right kr,1−α factor from Section 3.10 is necessary. Spreadsheet methods based on the binomial distribution (Chapter 11) can also be used to obtain exact test planning solutions. The application of spreadsheet functions along with the built-in Goal Seek routine is presented in Appendix 3A. Case I. How Many Units Should Be Put on Test The following items must be specified before a sample size can be chosen:

1. A failure rate or MTTF objective 2. A confidence level for ensuring we meet this objective 3. A test duration 4. A somewhat arbitrary and predetermined number of failures that we want to allow to occur during the test and still meet the objective With all these items specified, we know all the terms in the equation

r r t (n − r )t + 2

× kr ;1−α = λ obj

except n (the denominator is only an approximation of the total unit test hours for planning purposes—it assumes that all the r failures occur in the middle of the test time). Solving this equation for n gives the required sample size. EXAMPLE 3.12 CHOOSING SAMPLE SIZES We wish to be 90% confident of meeting a 0.2 %/K specification (MTTF = 500 K hours). We can run a test for 5000 hours, and we agree to allow up to 5 failures and still pass the product. What sample size is needed?

Solution Using the single-sided spreadsheet function CHIINV(1−0.9,2*(5+1))/(2*5) for a time-censored experiment (Table 3.5), we obtain the k5;90% factor 1.854935. The basic equation is

5 × 1854935 . = 0.000002 5000(n − 5) + 5 × 2500 And solving for n, we get n = 930. EXERCISE 3.14 Suppose we want to be 90% confident of meeting an MTTF of at least 2,000,000 hours (in other words, a failure rate of 500 FITs). We can test for 2000 hours, and we want to allow

73

Exponential Distribution

for up to two fails. What sample size do we need? What is the sample size if we only want 60% confidence? What if we allow only one fail and still want 60% confidence?

Variation: How long must the test run? Here, we have a fixed number of test units and have specified the failure rate objective, confidence level, and number of allowable failures. The only unknown remaining in the basic equation is the test time. EXAMPLE 3.13 CHOOSING THE TEST TIMES We have 100 units to test, and we want to be 95% confident that the MTTF is greater than 20,000 hours. We will allow up to 10 failures. How long must the test run?

Solution The k10;95% factor, from the spreadsheet function CHIINV(1 − 0.95,2*(10 + 1))/(2*10), is 1.696. Solving for T in

10 × 1696222 . = 0.00005 (90 × T ) + 5T results in a test time of 3571 hours. EXERCISE 3.15 We have 300 units to test, and we want to be 80% confident that the failure rate is less than 1000 FITs. If we allow up to four fails, how long must we test? What if we allow only one failure and reduce our confidence level to 60%?

Case II. With a Test Plan in Place, How Many Fails Can Be Allowed? It is good practice to clearly state the pass/fail criterion of a test in advance and have all interested parties agree on it. After the experiment is run, it is much more difficult to obtain such agreements. As before, a failure rate or MTTF objective must be stated, along with the confidence level. The test plan will fix the sample size and test length. From the basic equation, we can solve for the product r × kr;1−α, but to determine kr;1−α we need r, which is unknown. We can, however, easily set up three columns in a spreadsheet: one for r, one for evaluating kr;1−α using CHIINV(1−α,2*(r+1))/(2*r), and one for the product r × kr;1−α. The first value of r that produces a r × kr;1−α greater than our target value is one higher than the desired pass criteria. EXAMPLE 3.14 CHOOSING PASS/FAIL CRITERIA What is the maximum number of failures we can allow in order to be 80% confident of a failure rate no higher than 5 %/K if 50 units are to be tested for 2000 hours?

Solution The basic equation is [r/{(50 × 2000) − 1000r}] × kr;1−α = 0.00005. From this, we obtain r × (kr;1−α + 0.05) = 5. Figure 3.10 illustrates the spreadsheet columns. The second column uses CHIINV(0.2,2*(r+1))/(2*r) to provide the kr;80% values. The product r(k + 0.05), for r = 3, is greater than 5. Therefore, we subtract 1 from 3 and come up with a pass criterion of up to two failures on the test.

74

Applied Reliability

r

k

r(k + 0.05)

1 2 3 4 5

2.99 2.14 1.84 1.68 1.58

3.04 4.38 5.67 6.92 8.16

FIGURE 3.10 Spreadsheet columns for evaluating the product r × kr;1 − α.

Case III. What Are the Minimum Test Sample Sizes We Can Use? As before, we have to specify a failure rate objective, confidence level, and test duration. For a minimum sample size, we anticipate the best possible outcome, namely zero failures. This choice means we are prepared to state that the product has not demonstrated the specified failure rate at the required confidence level if we see even one fail. Minimum sample sizes are derived by setting k0;1−α/nT = λobj, where k0;1−α is the zero failures factor χ22;(1−α)/2 = −ln α (see Section 3.11). Solving for n gives

n=

− ln α λ objT

EXAMPLE 3.15 MINIMUM SAMPLE SIZES We want a minimum sample size that will allow us to verify a 40,000-hour MTTF with 90% confidence, given the test can last 8000 hours.

Solution The k0;90% factor is 2.3026 (from CHIINV(0.1,2)/2 or –ln(0.2)). Therefore, n = (40,000 × 2.3026)/ 8000 = 12 (after rounding up). EXERCISE 3.16 What is the minimum sample size that will allow us to verify a 500,000-hour MTTF with 85% confidence, given that the test can run for 2500 hours?

Variation: How do we determine minimum testing times? The number of test units, as well as the failure rate objective and the confidence level, are fixed in advance. The choice of T then becomes

T=

k0 ;1−α ln α =− nλ obj nλ obj

As before, if one fail occurs when the test is run, the failure rate objective will not be confirmed at the desired confidence level. EXAMPLE 3.16 MINIMUM TEST TIMES The failure rate objective is the very low number 10 PPM/K. We want to confirm this at an 80% confidence level. The component is an inexpensive resistor, and we plan to test 10,000 of them. How long should the test run?

75

Exponential Distribution

Solution By substitution, T = 1.6094/(10,000 × 10 × 10 −9) = 16,094 hours.

The test period in Example 3.16 is nearly 2 years of continuous testing and might be much too long to be practical, even though it is a “minimum” test time. This example illustrates the difficulties inherent in verifying very high levels of reliability. Since the trend is toward more and more failure rate objectives in the PPM/K range, reliability analysts will face this problem with increasing frequency. The better we make our components, the harder it becomves to assess their actual performance. One way out of this quandary is to test at high levels of stress, accelerating failure times as compared to what would happen at actual use conditions. We already saw this concept employed (but not mathematically explained) with the memory chip equivalent month of failure data (Example 3.3). A full discussion of acceleration modeling is in Chapter 8. At this point, we just note that if the test in Example 3.16 could be carried out in a test condition that accelerates failure times by a factor of 10 (i.e., a 10 × acceleration factor), then only 1609 test hours would be required, or under 10 weeks. EXERCISE 3.17 Suppose we want to confirm a 100-FIT rate at 90% confidence. We have 2000 components to test. What is the minimum test time? What if a 60% confidence limit is used?

Using a minimum sampling plan will generally save around 40% in sample size or test duration compared to allowing just one fail. On the other hand, deciding a product will not meet its objective based on a single fail leaves no margin for the one odd defective unit that might slip into the sample. It may turn out better to allow for a few failures and pay the extra sample size price rather than have a test result that many will not accept as valid. Use of minimum sample sizes makes sense if we are very confident that the product is much better than the objective, or we think it is so much worse that we will see many failures, even with the minimum sample.

3.13 Simulating Exponential Random Variables A general method of simulating any random variable starting with uniformly distributed random variables was given in Chapter 1. First, the inverse CDF F−1 is derived. Then, a uniform variable U is substituted for F in this inverse equation. For the exponential, we have

F(t) = 1 = e− λt = U F −1 (U ) = t =

− ln(1 − U ) λ

Since both U and 1 − U have the same uniform distribution over the unit interval (0,1), for purposes of simulation, this expression is equivalent to

t=

− ln(U ) λ

By substituting n random uniformly distributed numbers into the last equation, a random sample of size n from the exponential is generated. For example, we can use the spreadsheet function RAND() to generate pseudorandom uniform numbers in the interval [0, 1].

76

Applied Reliability

EXAMPLE 3.17 SIMULATING EXPONENTIAL DATA We will simulate 25 random failure times from an exponential distribution with MTTF = 3000 hours. The first step is to use a random number table, a hand calculator with a built-in random number generator, or a spreadsheet random number generator to generate uniform random numbers from the unit interval [0,1]. We used a spreadsheet RAND() program to obtain 0.49099 0.76521 0.82923

0.00794 0.82923 0.75458

0.4535 0.9044 0.17498

0.03049 0.14665 0.92835

0.45408 0.77664 0.73708

0.77058 0.91914 0.14085

0.12525 0.87884 0.3722

0.21737 0.57361

0.30474 0.63097

The next step is to put these in descending order starting with 0.92835, 0.91914, 0.9044, and so on. Next, transform each of these to random exponential data by using the formula t = (−ln U) × MTTF. The first failure time is t1 = (−ln 0.92835) × 3000 = 223 hours. The second is 253, and so on (it is left to the reader to calculate the remaining 23 failure times). EXERCISE 3.18 Take the first 20 random failure times generated in Example 3.17 and treat them as if they were obtained from a censored Type II life test with 25 units on test and r = 20. Estimate the MTTF and give a 90% confidence interval for it.

3.14 The Two-Parameter Exponential Distribution In some situations, it may be impossible for failures to occur before the end of a waiting period of length μ hours. This period is sometimes called the threshold time, and μ is a location or threshold parameter. Assume that after that waiting period, an exponential model with parameter λ is a good model for the population failure times. The distribution model for this population has the two-parameter exponential density function given by

f (t) = λ e− λ (t− ),

≤t0.250  0.000

Within round off, Minitab agrees with the direct calculations. More interestingly, the Minitab goodness-of-fit p values clearly reject a one-parameter fit but have no problem with the twoparameter exponential fit. The LRT p value is a likelihood ratio test (explained in Section 8.4) of whether a two-parameter exponential offers significant improvement as opposed to a one-parameter exponential fit, and the p value indicates that the improvement gained by adding the second parameter was very highly significant (the p value rounds off to 0).

3.15 Summary The exponential life distribution is defined by F(t) = 1 − e−λt. It is characterized by being the only distribution to have a constant failure rate. This constant failure rate has the same value as the one unknown parameter λ, which is also the reciprocal of the MTTF. Another important characteristic of the exponential is its lack of memory. When failures seem to occur randomly at a steady rate, with no significant wear out or degradation mechanisms, then the exponential will be a good model. It is appropriate for the long flat portion of the popular bathtub curve for failure rates. It is also useful when we want to verify an AFR over an interval, and the time pattern of fails over the interval is of little concern. The key formulas and properties of the exponential distribution are summarized in Table 3.7. The best estimate of λ from censored Type I or Type II data is the number of fails divided by the total hours on test of all units. For readout data, the total hours may not be known exactly. Even in this case, however, the total unit test hours can often be estimated with little loss of precision. When λ has been estimated from censored Type I or II data, there are factors that depend only on the number of fails that can be used to obtain upper and lower bounds on λ. These

78

Applied Reliability

TABLE 3.7 Summary of Exponential Distribution Properties Name CDF F(t) Reliability R(t) PDF f(t) Failure rate h(t) or renewal rate Average failure rate (t1, t2) Mean E(t) or mean time to fail (also mean time between failures) Variance V(t) Median Mode Lack of memory property

Closure property

Value or Definition 1 − e−λt e−λt λe−λt λ λ 1/λ 1/λ2 0.693 λ 0 The probability a component fails in the next t2 hours, after operating successfully for t1 hours, is the same as the probability of a new component failing in its first t2 hour A system of n independent components, each exponentially distributed with parameters λ1, λ2, … λs, respectively, has an exponential distribution for the time to first failure, with parameter λs = λ1 + λ2 + … λs

factors are easily determined using the CHIINV spreadsheet function for the chi-square distribution and are given in Table 3.5. Table 3.6 has formulas for upper bounds when there are no fails on the test. The factors from Tables 3.5 and 3.6 can also be very useful in the planning stages of an experiment. Sample sizes, or test durations, can be calculated once the failure rate objective and a confidence level are specified. The zero fail formulas yield minimum sample sizes. Even when the unit to be tested does not have an exponential distribution, the numbers obtained from these tables are useful for rough planning. When data are available, a histogram plotted on the same graph as the estimated PDF f(t) (properly scaled up) gives a good visual check on the adequacy of using an exponential model. In addition, either the chi-square goodness-of-fit test or one of the tests based on the EDF can be used to do an analytic check for the fit. If there is a threshold time before which failures cannot occur, followed by a constant failure rate, then the two-parameter exponential model can be used.

Appendix 3A 3.1A Test Planning via Spreadsheet Functions Determining the Sample Size We would like to determine the sample size n such that the probability of r or fewer failures occurring for a specified test time t and failure rate objective λobj is no greater than the α value, where the confidence level = 1 − α (in percent). That is, we want to find

79

Exponential Distribution

n such that P(X ≤ r ) = α, where X is the random variable for the number of failures. This probability can be determined using the binomial distribution (see Chapter 11) with p = F(t) = 1 − e− λ objt . Thus, r

α = P(X ≤ r ) =

∑ i= 0

n [ F(t)]i [1 − F(t)]n− i i

The spreadsheet function for the binomial distribution is BINOMDIST, with four arguments: the number of allowed failures r, the sample size n, the probability of failure p at time t, and “1” or TRUE for the cumulative binomial result. For any desired F(t), number of failures r, and confidence level 1 − α, we can solve for n in BINOMDIST (r, n, F(t), 1) by trial and error. We adjust n until the function equals a. However, we can easily apply the spreadsheet Goal Seek routine to facilitate a solution as shown in example below. Note the result may be noninteger, but r and n are truncated to integers in BINOMDIST. EXAMPLE 3.1A DETERMINING THE SAMPLE SIZE USING GOAL SEEK (EXAMPLE 3.12 REVISITED) We wish to be 90% confident of meeting a 0.2 %/K specification (MTTF = 500 K hours). We can run a test for 5000 hours, and we agree to allow up to five failures and still pass the product. What sample size is needed?

Spreadsheet Solution Set up two columns in a spreadsheet as shown in Figure 3.1A. Column A describes the corresponding input in column B. Enter the target MTTF and the test time in hours into cells B3 and B4, respectively. In cell B5, evaluate the exponential CDF for the MTTF objective and the test time by entering the formula = 1− exp(−(B4/B3)), or equivalently = 1 − exp(−5000/500000), which returns the result 0.00995. Into cell B6, we enter the number of allowed failures, that is, five, into cell B6. Enter a trial guess for the sample size, for example, 100, into cell B3. Then, enter the function BINOMDIST(B6,B7,B5,1) into cell B8, which returns the result 0.999479. Under the Tools menu, select Goal Seek. When the dialog box appears, make sure the Set (Formula) cell is B8, the To (target) value is set to α, that is, 0.1, and the By changing (variable) cell is set to B7. Hitting the OK button produces the result 930.3673 in cell B7, or 930 truncated. The Goal Seek dialog box labels vary between Excel and OpenOffice, and the final convergence can result in slightly different values between the two programs due to differences in search routines of the starting values you choose. Note that while the Goal Seek routine may converge closely to the final exact number, it may be necessary to fine-tune the result manually to get an even more precise fit to the desired alpha value. In this case, n = 930 is the best value.

FIGURE 3.1A Spreadsheet entries to determine sample size.

80

Applied Reliability

EXERCISE 3.1A (EXERCISE 3.14 REVISITED) Suppose we want to be 90% confident of meeting an MTTF of at least 2,000,000 hours (in other words, a failure rate of 500 FITs). We can test for 2000 hours, and we want to allow for up to two fails. Using the spreadsheet Goal Seek routine, what sample size do we need? Try manually adjusting the result to see if we can obtain a closer match to the alpha value. What is the sample size if we only want 60% confidence? What if we allow only one fail and still want 60% confidence? Compare results to answers for Exercise 3.14.

Determining the Test Length Using Spreadsheet Functions Consider again the spreadsheet function BINOMDIST and the four arguments: the number of allowed failures r, the sample size n, the probability of failure p at time t, and “1” or TRUE for the cumulative binomial result. For any sample size n, number of failures r, and confidence level 1 − α, we can adjust the time value t by trial and error in the exponential formula for F(t) in BINOMDIST (r, n, F(t), 1) until the function equals α. Again, we can apply the spreadsheet Goal Seek routine to facilitate a solution as shown in Example 3.2A. EXAMPLE 3.2A CHOOSING THE TEST TIMES We have 100 units to test, and we want to be 95% confident that the MTTF is greater than 20,000 hours. We will allow up to 10 failures. How long must the test run?

Spreadsheet Solution Set up two columns in a spreadsheet as shown in Figure 3.2A. Into cell B3 enter the target MTTF, and into cell B4, enter a trial test time, say 1000. In cell B5, evaluate the exponential CDF for the MTTF objective and the trial test time by entering the formula = 1− exp(−(B4/B3)), which returns the result 0.048771. Into cell B6, we specify the number of allowed failures, that is, 10. The sample size of 100 is entered into cell B3. Type the formula BINOMDIST(B6,B7,B5,1) into cell B8, which returns the result 0.990351. Under the Tools menu, select Goal Seek. When the dialog box appears, make sure the Set (Formula) cell is B8 the To (target) value is set to α, that is, 0.05, and the By changing (variable) cell is set to B4. Hitting the OK button produces the result 3573.798 in cell B4, or 3574 hours, rounded up. As mentioned before, manual fine-tuning may be needed to get an even more precise fit to the desired alpha value. In this case, we confirm that a slight adjustment to 3576 hours provides the best fit. EXERCISE 3.2A (EXERCISE 3.15 REVISITED) We have 300 units to test, and we want to be 80% confident that the failure rate is less than 1000 FITs. If we allow up to four fails, use the spreadsheet Goal Seek routine to

FIGURE 3.2A Spreadsheet entries to determine test length.

81

Exponential Distribution

determine how long we should test. What if we allow only one fail and reduce our confidence level to 60%? Try manually adjusting the results for both cases to see if we can obtain a closer match to the alpha value.

Determining the Number of Allowed Failures via Spreadsheet Functions As before, we can adjust the number of allowed failures in the spreadsheet function BINOMDIST (r, n, F(t), 1) until the function equals a. We can apply the spreadsheet Goal Seek routine to obtain a solution, but since the number of allowed failures is usually very small, simple trial and error can produce the desired result quickly. Also, because of the discrete nature of the allowed number of failures, achieving an exact fit to the alpha value will normally not occur. We choose the value for r that provides at least the desired alpha risk. EXAMPLE 3.3A CHOOSING PASS/FAIL CRITERIA What is the maximum number of failures we can allow in order to be 80% confident of a failure rate no higher than 5 %/K if 50 units are to be tested for 2000 hours?

Spreadsheet Solution Set up two columns in a spreadsheet as shown in Figure 3.3A. Use the spreadsheet functions as previously explained for Figures 3.1A and 3.2A. Start with 0 in cell B6. The result in cell B8 is 0.006738. We try the values 1, 2, and 3 in cell B6 and obtain the respective numbers 0.04217, 0.133466, and 0.287095 in cell B8. We see immediately that the number of allowed failures cannot exceed 2 since the alpha level of 0.2 is exceeded for the value 3.

3.2A EDF Goodness-of-Fit Tests Using Spreadsheets KS Test We have shown in Section 3.9 how EDF goodness-of-fit tests can be done using Minitab and JMP. However, spreadsheets can also be used to do such tests, including the use of simulation to obtain probability estimates for the particular statistic. We will discuss two important scenarios in testing whether sample data fits an assumed CDF: (1) the population CDF is completely known, and (2) the CDF has one or more parameters estimated from the sample. The analysis of censored sample data is covered in Chapters 6 and 7. We will illustrate the approach via an example.

FIGURE 3.3A Spreadsheet entries to determine number of failures allowed.

82

Applied Reliability

EXAMPLE 3.4A KS TEST Ten components are placed in a radiation chamber and exposed to a specific dosage. The recorded failure times in hours are 1.4, 2.3, 8.8, 14, 15.4, 27.4, 34.3, 76.8, 217.5, and 390.8. The manufacturer claims the failure distribution—under the stress conditions applied—is exponential with an MTTF of 100 hours. Using the KS test, determine if the data support the manufacturer’s claim. Repeat the test, but this time use the estimated MTTF from the data to test for an exponential distribution.

Solution Set up the spreadsheet as shown in Figure 3.4A. In column A, enter the failure counts i = 1, 2, 3, …, 10 into cells A3 through A12. In column B, starting in cell B3, enter the ordered failure times t(i) = 1.4, 2.3, … , 390.8. For the EDF based on a sample of size 10, in column C, starting with cell C3, enter = A1/10 and then use the autofill feature to complete column C up to cell C12. In cell D3, we enter the formula for the exponential distribution for F(t), with MTTF = 100, that is, 1−EXP(−B3/100). B3 refers to the cell with the first time to failure of 1.4 hours. Then, again use the autofill feature to fill in the remaining cells up to D12. In cell E3, we enter the formula = ABS(C3−D3) and autofill up to E12, where ABS is the spreadsheet function for absolute value. In cell F3, we enter the formula = ABS(D3−(A3−1)/10) and again autofill. Finally, in cell F13, we find the D-statistic through the formula MAX(E3:F12). The result is D = 0.409755. We graph the EDF and the specified exponential CDF model with MTTF = 100, along with the D-statistic, as shown in Figure 3.5A. There are several options to determine the significance level of the resulting D value. We could reference a table of percentage points of the Kolmogorov statistics (see Miller 1956) available in many books on nonparametric statistics. We could use simulation as described in Ross (2006). Zhang and Wu (2002) have developed a beta approximation to the distribution of the KS statistic. However, by application of a simple formula that incorporates the sample size n, D’Agostino and Stephens (1986) have provided a single line of significance points for what is called the modified T*statistic for various EDF statistics. The formula D* for the D-statistic when F(t) is completely known is

D* = D( n + 0.12 + 0.11/ n ) Here, D* = 0.409755( 10 + 0.12 + 0.11/ 10 ) = 1.359. The D* critical values are shown in Table 3.1A. Comparing D* to the tabulated values, we see that the sample D* statistic is so large

FIGURE 3.4A Spreadsheet entries for KS goodness-of-fit test.

83

Exponential Distribution

1

EDF, CDF

0.8 D + and D

0.6

EDF (i/n) Exponential F(t)

0.4

0.2

0

0

50

100

150

200 tf

250

300

350

400

FIGURE 3.5A Empirical distribution function and exponential CDF model (mean time to fail = 100).

TABLE 3.1A Percentage Points for Modified Kolmogorov D*-Statistics for F(t) Known F(t) Completely Known Significance level α Percentage points D*

25%

15%

10%

5%

2.5%

1.0%

0.50%

0.10%

1.019

1.138

1.224

1.358

1.48

1.628

1.731

1.95

Source: D’Agostino, and Stephens. 1986. Goodness-of-fit techniques. In ed. R. B. D’Agostino and M. A. Stephens. Marcel Dekker Inc.

that less than 5% of the time it would be exceeded by chance alone if the sample failure times actually derived from an exponential distribution with MTTF = 100 hours. Consequently, the manufacture’s claim is rejected. Consider now the situation where we want to perform an EDF goodness-of-fit test for the exponential distribution but the MTTF is estimated from the sample data. In this case, the MTTF estimate can be obtained by using the spreadsheet function AVERAGE(C3:C12) on the time to failure data column C, with the result MTTF = 78.86. We proceed to test for goodness of fit as before, except now the test is against the exponential distribution with the estimated MTTF. We leave it as an exercise to show that the D-statistic now becomes 0.347437. This value can be compared to the table of critical values, such as Lilliefors (1969). However, we instead use the modified D*-statistic from D’Agostino and Stephens (1986), obtained from the formula

D* = (D − 0.2/ n)( n + 0.26 + 0.05 / n ) In this case, D* = 1.172, which we compared to Table 3.2A. Again, we see that D* exceeds the critical value 1.094 at the 5% level of significance, causing us to reject the exponential model.

84

Applied Reliability

TABLE 3.2A Percentage Points for Modified Kolmogorov D*-Statistics (Mean Unknown) Exponential Distribution for F(t) with Mean Unknown Significance Level α

15%

10%

5%

2.5%

1.0%

Percentage points

0.926

0.995

1.094

1.184

1.298

Source: D’Agostino, and Stephens. 1986. Goodness-of-fit techniques. In ed. R. B. D’Agostino and M. A. Stephens. Marcel Dekker Inc.

EXERCISE 3.3A Using the data in Example 3.4A, set up a spreadsheet to do a KS test for goodness of fit for an exponential distribution using the estimated MTTF. Plot the EDF and exponential model on the same graph and confirm the D value graphically.

One of the issues with tabulated values is that the actual significance level of the D-statistic is not determined. For example, D* in Example 3.4A was 1.172, which corresponds to an alpha between 5% and 2.5%. Interpolation may be applied as an approximation. However, simulation can provide the precise significance level. The approach is described in Ross (2006). Basically, we randomly generate (see Section 3.13) 10 times to failure from an exponential distribution with the MTTF estimate based on the original sample. Next, we estimate the MTTF of the randomly generated sample and calculate a D-statistic based on the EDF (in this case, 1/10, 2/10, 3/10, …) of the simulated data to the exponential CDF with the MTTF estimated from the random sample. We repeat this process a large number of times, each random sample deriving from the exponential distribution with the MTTF of the original sample. The significance level is the ratio of the number of times to the total number of simulations that the simulated D-statistic exceeds the D-statistic for the original data. It is helpful to have a program that runs a desired number of simulations and keeps track of the successful events (see Trindade and Meade 2001 or Savage 2003 for Excel addins that facilitate simulation studies). For this example, 10,000 simulations by the authors produced a p value equal to .028 for D = 0.347437.

Problems 3.1 How many units do we need to verify a 500,000-hour MTTF with 80% confidence, given that the test can run for 25,000 hours and 2 failures are allowed? 3.2 Suppose we want to confirm a 500-FIT rate at 90% confidence. We have 2500 components to test, and we are willing to allow up to 2 fails. How long should the test run? If we reduce the confidence level to 60%, how long must we test? What is the minimum test time (no fails allowed) at 60% confidence? At 75% confidence? 3.3 We want to check for a failure rate of 1600 FITs at 60% confidence. We have 1000 units and we can test for 2000 hours. How many fails can we allow? 3.4 Suppose a sampling plan designed to screen product lots requires a burnin done on 300 pieces for 168 hours, with no more than 1 fail allowed for the lot to be accepted. Assuming a 60% confidence level, how many hours of burn-in

85

Exponential Distribution

are needed to get the same level of protection if we switch to only accepting lots with zero fails? How many units must we burn-in if we shorten the time to 48 hours, keep 60% confidence, with zero fails allowed and the same level of failure rate protection? 3.5 For the product screen discussed in Problem 3.4, what is the danger of shortening the burn-in time, even if larger sample sizes are used? (Hint: Remember the assumptions underlying the exponential model, such as the lack of memory property.) Does 100% burn-in of a lot make sense if the exponential distribution really applies to all the units in the lot? What is the point of burn-in applied to a small sample from such a lot? 3.6 Assume a population is known to follow the life distribution model F(t) = 1 − e−λt2

with λ unknown. An experiment (censored, Type I) is run, and 10 units fail out of 25 on test during the 1000-hour test time. Describe how you could estimate λ and construct confidence bounds for λ using the methods and tables given in Chapter 3. (Hint: Consider the new random variable given by X = t2, where t is a random failure time from the given population.) 3.7 Simulate the failure time of 50 units randomly selected from an exponential population with MTTF = 10,000 hours. Estimate λ and the MTTF. Next, “censor” the simulated failure times at 2000 hours to represent the end of a censored Type I test and reestimate λ and the MTTF. 3.8 The PDF formula for the chi-square distribution with k degrees of freedom is given by

f (x ; k ) = 2 0

k/2

1 x( k/2 )−1 e− x/2 Γ( k /2)

for x > 0, for   x ≤ 0,

where Γ denotes the gamma function defined in Exercise 4.6. When n is an inte− ln α ger, Γ(n) = (n − 1)! Use this to derive the portion of the formula given in nT Section 3.11,

λ (1−α ) =

χ 22 ;(1−α ) 2 nT

=

− ln α nT

for upper bounds for λ when there are zero fails out of n units on test for T hours (assuming the exponential lifetime model applies). Also, derive this formula using the probabilistic argument that λ (1−α ) should be a value of λ so large, but no larger, that there is only an α chance of not seeing a single failure out of n exponential units on test for T hours.

This page intentionally left blank

4 Weibull Distribution In Chapter 3, we saw simple yet powerful methods for analyzing exponential data and planning life-test experiments. Questions about sample size selection, test duration, and confidence bounds could all be answered using a few spreadsheet functions. However, these methods apply only under the constant failure rate assumption or the equivalent “lack of memory” property. As long as this assumption is nearly valid over the range of failure times we are concerned with, we can use the methods given. In contrast, what do we do when the failure rate is clearly decreasing (typical of early failure mechanisms) or increasing (typical of later life wear-out mechanisms)? This problem was tackled by Weibull (1951). He derived the generalization of the exponential distribution that now bears his name. Since that time, the Weibull distribution has proven to be a successful model for many product failure mechanisms because it is a flexible distribution with a wide variety of possible failure rate curve shapes. In addition, the Weibull distribution can be derived as a so-called extreme value distribution, which suggests its theoretical applicability when failure is due to a “weakest link” of many possible sites where failure can occur. First we will derive the Weibull as an extension of the exponential. Then we will discuss the extreme value theory. We shall learn that not only does the Weibull appear to “work” in many practical applications, but there is also an explanation to tell us why it applies and in what areas it is likely to be most successful.

4.1 Empirical Derivation of the Weibull Distribution The goal is to find a cumulative distribution function (CDF) that has a wide variety of failure rate shapes, with the constant h(t) = λ as just one possibility. Allowing any polynomial form of the type h(t) = at b for a failure rate function achieves this objective. In order to derive F(t), it is easier to start with the cumulative hazard function H (t). Because dH (t) , setting h(t) = dt

H (t) = (λt)β

results in

h(t) = βλ(λt)β−1

This form gives us the exponential constant failure rate when β = 1 and a polynomial failure rate for other values of β.

87

88

Applied Reliability

Now, we use the basic identity relating F(t) and H (t):

F(t) = 1 − e− H (t ) = 1 − e− ( λt )β

We obtain the equation for the Weibull CDF by making a substitution of α = 1/λ in the above equation and write

F(t) = 1 − e− (t/α )β

The parameter α (alpha) is a scale parameter that is often called the characteristic life. The parameter β (beta) is known as the shape parameter. Both α and β must be greater than zero, and the distribution is a life distribution defined only for positive times 0 ≤ t < ∞. The probability density function (PDF) f(t), failure rate h(t), and the average failure rate (AFR) function for the Weibull are given by the following:

f (t) =

h(t) =

β t t α

β t α α

β

e− (t/α )β

β−1

AFR(t1 , t2 ) =

AFR(T ) =

=

β t t α

β

t1 α

β

β

t2 α

−

t2 − t1 1 T α α

β−1

There is, unfortunately, no consistent convention used throughout the literature when naming the Weibull parameters. Previous editions of this text used m for the shape parameter and c for the characteristic life. This edition uses β (for shape) and α (for characteristic life), consistent with Nelson (1982, 1990) and JMP software. Meeker and Escobar (1998) also used β for the shape parameter but η (eta) for the characteristic life. To make matters even more confusing, both Excel and OpenOffice (and a few other programs) reverse the symbols by calling the shape parameter α and the characteristic life parameter β. Minitab avoids taking sides by not giving the parameters names other than scale (for characteristic life) and shape. Other textbooks use λ or p to name the shape parameter. Finally, a form of the Weibull is occasionally encountered where the scale parameter θ = α β is used, resulting in a Weibull CDF that is written as follows:

F(t) = 1 − e

−

tβ θ

Because of this confusion of terminology and the meanings of the parameters, the reader should be careful about the definitions used when reading Weibull literature or using Weibull software routines.

89

Weibull Distribution

EXAMPLE 4.1 WEIBULL PROPERTIES A population of capacitors is known to fail according to a Weibull distribution with characteristic life α = 20,000 power-on hours. Evaluate the probability that a new capacitor will fail by 100, 1,000, 20,000, and 30,000 hours, for the cases where the shape parameter β equals 0.5, 1.0, and 2.0. Also calculate the failure rates at the specified times for these three shape parameters.

Solution Table 4.1 gives the various values requested. Several results are worth noting. In particular, observe that when the shape value is 0.5, the CDF at 100 hours is much higher than when β = 1 or β = 2. Also, the failure rate values for β = 0.5 are highest at the early time and decrease with each later time. Exactly the opposite is true for the failure rate values when β = 2.

Table 4.1 also shows, as the Weibull derivation given in this section indicates, that the failure rate for β = 1 is a constant. Thus, the Weibull reduces to an exponential with failure rate λ = 0.00005 = 1/α . Finally, note thvat the percent fail values at the characteristic lifetime value of 20,000 hours were uniformly 63.2% for all the choices of β. All these observations, based on Table 4.1, illustrate general points about the Weibull distribution that will be discussed in Section 4.2. The Weibull CDF equation has four quantities that may be known, assumed, or estimated from data. These include the cumulative fraction failed F(t), the time t, the shape parameter β, and the characteristic life parameter α. If any three of these are known, the fourth can be calculated by one of the following equations: F(t) = 1 − e− (t/α )β

ln[− ln(1 − F(t))] t ln α

β=

t = α[− ln(1 − F(t))]1/β

α=

t [− ln(1 − F(t))]1/β

TABLE 4.1 Solution to Example 4.1 α = 20,000

CDF in Percentage

Failure Rate in %/K

Time

β = 0.5

β = 1.0

β = 2.0

β = 0.5

β = 1.0

β = 2.0

100 1,000 20,000 30,000

6.8 20.0 63.2 70.6

0.5 4.9 63.2 77.7

0.002 0.2 63.2 89.5

35.4 11.2 2.5 2.0

5.0 5.0 5.0 5.0

0.05 0.5 10.0 15.0

90

Applied Reliability

4.1.1 Weibull Spreadsheet Calculations Assume that the population distribution is Weibull for the following five exercises. For exercises involving the Weibull CDF, one can use Excel or OpenOffice. The expression WEIBULL(t,β,α,1), with commas replaced by semicolons for OpenOffice, returns the Weibull CDF with shape β and characteristic life α, evaluated at time t. Change the last argument from 1 to 0 and one gets the Weibull PDF evaluated at time t. In addition, JMP and Minitab have the capability to evaluate the inverse Weibull function (the third of the four equations above). JMP’s formula editor calls this inverse function the Weibull Quantile and Minitab includes it under Probability Distributions in the Calc platform. EXERCISE 4.1 Find the characteristic life necessary for 10% failures by 168 h, given a shape parameter of 2.0. EXERCISE 4.2 Find the expected cumulative percent fallout at 1,000 hours, given a characteristic life of 1,000,000 hours and a shape parameter of 0.5. EXERCISE 4.3 Find the time to achieve 20% failures, given a characteristic life of 50,000 hours and a shape parameter of 1.0. EXERCISE 4.4 Find the shape parameter necessary for 5% failures in the first 2,000 hours, given a characteristic life of 30,000 hours. EXERCISE 4.5 Find the characteristic life necessary to have an AFR of 100 FITs over the first 40,000 hours, given a shape parameter of 3.0.

4.2 Properties of the Weibull Distribution The strength of the Weibull lies in its flexible shape as a model for many different kinds of data. The shape parameter β plays the major role in determining how the Weibull will look. For 0 < β < 1, the PDF approaches infinity as time approaches zero and is always decreasing rapidly toward zero as time increases. The failure rate behaves the same way, making this type of Weibull a useful model for an early failure mechanism typical of the front end of the bathtub curve. When β = 1, the Weibull reduces to a standard exponential with constant failure rate λ = 1/α. For β > 1, the PDF starts at zero and increases to a peak at α[1 − (1/β)]1/β, after which it decreases toward zero as time increases. The shape is skewed to the right. The failure rate also starts at zero but then increases monotonically throughout life. The rate of increase depends on the size of β. For example, if β is 2, the failure rate increases linearly (the distribution is also known as the Rayleigh distribution). When β is 3, the failure rate has a

91

Weibull Distribution

quadratic rate of increase, and so on. This type of Weibull is a useful model for wear-out failure mechanisms typical of the back end of the bathtub curve. Figure 4.1 shows several examples of Weibull CDFs, and Figure 4.2 illustrates the Weibull PDFs. Figure 4.3 is a graph of a variety of Weibull failure rate (hazard) curves. Table 4.2 summarizes the way the Weibull varies according to the value of its shape parameter β. 1

β = 10

0.9

β=4

β=2 β=1

0.8

β = 0.5

0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.5

1 1.5 2 Units = Multiples of characteristic life α

2.5

3

FIGURE 4.1 Weibull CDF. 5

4

β = 10

PDF

3

2

β = 0.5

1

0

FIGURE 4.2 Weibull PDF.

β=4 β=2

β=1

0

0.5

1 1.5 2 Units = Multiples of characteristic life α

2.5

3

92

Applied Reliability

10 9 8

β = 10

Hazard rate

7

β=4

6 β=2

5 4 3 β = 0.5

2

β=1

1 0

0

0.5

1 1.5 2 Units = Multiples of characteristic life α

2.5

3

FIGURE 4.3 Weibull failure rate (hazard rate).

TABLE 4.2 Weibull Distribution Properties Shape Parameter β 0 10

PDF Exponentially decreasing from infinity Exponentially decreasing from 1/α Rises to peak and then decreases Rayleigh distribution Has “normal” bell-shaped appearance Has shape very similar to Type I extreme value distribution

Failure Rate h(t) Same Constant Increasing Linearly increasing Rapidly increasing Very rapidly increasing

The second parameter of the Weibull, α, is a scale parameter that fixes one point of the CDF, the 63.2 percentile or characteristic life point. If we substitute α for time in the CDF, we obtain

F(α) = 1 − e− (α/α )β = 1 − e−1 = 0.632…

In other words, roughly 63.2% of the population fails by the characteristic life point, independent of the value of the shape parameter β. The median, or T50 point, for the Weibull is found by letting F(T50 ) = 0.5 to yield

T50 = α(ln 2)1/β

The characteristic life can be obtained from the median using the formula

α = T50/(ln 2)1/β

93

Weibull Distribution

EXERCISE 4.6 A manufacturer produces parts that have a median width of 2.0 cm. The population distribution is Weibull with a shape parameter of 2.0. If the specification for acceptable parts is 1.0 to 3.0 cm, what fraction of the parts produced are rejectable?

To give formulas for the mean and the variance of the Weibull distribution, it is necessary to introduce a mathematical function known as the gamma function. This function is defined by ∞

Γ( x) =

∫y

x−1

e− y dy

0

and is tabulated in many places; for example, see Abramovitz and Stegun (1964). In particular, when x is an integer, Γ( x) = ( x − 1)!. When x = 0.5 , Γ(0.5) = π . In general, for x > 0, Γ( x) = ( x − 1)Γ( x − 1). Excel and OpenOffice can evaluate Γ(x) with the expression EXP(GAMMALN(x)), and JMP has the gamma function included within the function editor. The mean of the Weibull is α Γ(1 + 1/β) and the variance is

α 2 Γ(1 + 2/β) − [αΓ(1 + 1/β)]2

For example, for β = 0.5, 1, and 2, the means are 2α, α, and α ≠/2 , respectively. Note that this mean, or MTTF, no longer has any direct relationship to the failure rate (unless β = 1). So the Weibull MTTF is not that useful or meaningful a number as compared to a graph of the failure rate or the AFR calculated over an interval of interest. EXERCISE 4.7 A Weibull distribution has β = 0.5 and α = 1,000 hours. Determine the mean, median, and standard deviation. Sketch the PDF curve and indicate the positions of the mean and median.

The Weibull also has a closure or reproductive property, similar to the exponential. If a system is composed of n parts, each having an independent Weibull distribution with the same shape parameter but not necessarily the same characteristic life, and the system fails when the first component fails (a series system), then the time to the first system failure also follows a Weibull distribution. If the characteristic life parameters are α 1 , α 2 , ... , α n and the shape parameter is β , then the system failure distribution is also Weibull with shape parameter β. The system characteristic life is given by

n

1 αs = β α i i=1

∑

−

1 β

EXAMPLE 4.2 WEIBULL CLOSURE PROPERTY A car manufacturer uses five different hoses as a part of the engine cooling system for one of its models. The hose manufacturer specifies that each hose has a lifetime modeled adequately by a Weibull distribution with shape parameter 1.8. The five hoses have characteristic lives, in months of average car use, of 95, 110, 130, 130, and 150. What is the life distribution for time to first car hose failure? What is the MTTF? What is the median or T50? How likely is it that no hose fails in the first year of car life? By 4 years?

94

Applied Reliability

Solution Applying the closure property relationship, we have that the hose system (first) failure distribution is a Weibull with shape parameter β = 1.8 and characteristic life

αs =

1 1 1 1 1 + + + + 1 . 8 1 . 8 1 . 8 1 . 8 95 110 130 130 1501.8

−1/1.8

= 48.6

The MTTF is αΓ(1+ 1/1.8) = 43.2 months. The T50 is α(ln 2)1/ β = 39.6 months. The reliability at 12 months, or [1 − F (12)], is 92%. The reliability at 48 months is approximately (100% − the characteristic life percent), or 37%. EXERCISE 4.8 A system consists of four identical, independent components. If one component fails, the system fails. The components’ failure times have a Weibull distribution with characteristic life α = 10,000 hours and shape parameter β = 0.5. Determine the system failure distribution. Find the system MTTF and median life. How likely is it that no system failures occur in the first 100 hours?

The parameter definitions and key Weibull formulas given in this section are summarized in Table 4.3.

TABLE 4.3 Weibull Formulas Summary Name

Value or Definition

CDF F(t)

F(t) = 1 − e

Reliability R(t)

R(t) = e− (t/α )β

PDF f(t)

f (t) =

− ( t/α )β

β t t α

β

e− (t/α )β

Characteristic life α

F(α) ≅ 0.632

Shape parameter β

Key parameter that determines shape of Weibull

Failure rate h(t)

h(t) =

β t α α

β−1

= t2 α

β t t α β

−

t1 α

β

β

AFR (t1, t2)

AFR(t1 , t2 ) =

MTTF or mean E(t)

αΓ(1 + 1/β)

Variance V(t)

α 2 Γ(1 + 2/β) − [αΓ(1 + 1/β)]2

Median = T50

T50 = α(ln 2)1/β

Mode

1 α 1− β n

System characteristic life when components are independent Weibull with parameters (β, α i )

αs =

∑ i=1

t2 − t1

1 β

−

1 α βi

1 β

95

Weibull Distribution

4.3 Extreme Value Distribution Relationship In general, if we are interested in the minimum of a large number of similar independent random variables, as, for example, the first time of failure reached by many competing similar defect sites located within a material, then the resulting life distribution can converge to only one of three types. These types, known as smallest extreme value distributions, depend on whether we define the random variables on the entire x-axis or just the positive or negative halves of the axis. The extreme value distributions were categorized by Gumbel (1954), and they are

F( x) = 1 − e− ex       − ∞ < x < ∞ (Gumbel Type I)

F( x) = 1 − e− ( − x )− a

x≤0

F( x) = 1 − e− x a

x ≥ 0 ( Weibull)

In the above formulas, the random variables have been suitably shifted by location parameters and divided by scale parameters so as to appear “standardized.” Of the three possible types, the only one that is a life distribution (i.e., defined only for nonnegative values) turns out to be the Weibull. Because failure can often be modeled as a weakest link of many competing failure processes, the wide applicability of the Weibull is not surprising. The justification for using the Weibull, based on extreme value theory, is an important and useful one. Such reasoning can also be abused and can lead to vigorously defended misapplications of the Weibull. For this reason, the strengths and weaknesses of this derivation should be carefully noted. If there are many identical and independent competing processes leading to failure and the first to reach a critical stage determines the failure time, then—provided “many” is large enough—we can derive a Weibull distribution. In a typical modeling application, we might suspect or hope all the italicized assumptions apply, but we are not likely to be certain. How do we pick a life distribution model from a practical viewpoint? The approach we recommend is as follows:

1. Use a life distribution model primarily because it works, that is, it fits the data well and leads to reasonable projections when extrapolating beyond the range of the data. 2. Look for a new model when the one previously used no longer “works.” 3. Select models by researching which models have been used in the literature successfully for similar failure mechanisms, or by using theoretical arguments based on physical models of failure. Examples of theoretical arguments are a. Extreme value theory b. The multiplicative degradation model theory of Chapter 5 c. The “lack of memory” exponential model of Chapter 3

Another situation where theoretical derivations prove very useful is when several istributions all seem to work well with the available data but give significantly different d results when projected into critical tail regions that are beyond the range of the data.

96

Applied Reliability

For example, say we fit two different life distribution models to a sample of 25 failure times and observe that both models appear to match the sample data reasonably well. With a sample size of 25, we cannot directly observe distributional percentiles below 4%, but we may be very interested in the CDF and failure rate behavior at a small percentile such as 0.1% or even 0.01%. This left-tail region could be “critical,” and the two competing models might have failure rate projections (extrapolations) that differ by orders of magnitude for times of interest located in this tail region. When this situation happens, pick the model that has a theoretical derivation most closely connected with the mechanism causing the failures. There is another very interesting mathematical relationship between the Weibull distribution and the Type I extreme value distribution. It turns out that the natural logarithms of a population of Weibull failure times form a population following the Type I distribution. In other words, the natural logarithm of a Weibull random variable is a random variable that has the Type I extreme value CDF. This relationship, as we shall see in Chapter 5, is exactly the same as exists between the lognormal life distribution and the normal distribution. So, if the Weibull had not been named in honor of its chief advocate, it probably would have been called the log-extreme value distribution. The exact statement of this relationship is as follows: let tf be a Weibull random variable with CDF

F(t) = 1 − e− (t/α )β

Then the random variable X = ln tf has the Type I extreme value CDF given by

1 − e− e( x− λ )/δ

with λ = ln α and δ = 1/β. A reliability analyst who has computer programs or graph paper designed to handle Type I extreme value data can also analyze Weibull data after first transforming it into extreme value data by taking natural logarithms. Later on, the scale parameter δ can be used to estimate the Weibull shape parameter via β = 1/δ. The location parameter λ is transformed into the Weibull characteristic life scale parameter via α = eλ.

4.4 Areas of Application After the introduction of the Weibull distribution, its use spread across a wide variety of applications, from vacuum tubes and capacitors to ball bearings and relays and material strengths. The primary justification for its use has always been its flexibility to match a wide range of phenomena. There are few, if any, observed failure rates that cannot be accurately described over a significant range of time by a polynomial or Weibull hazard function. Some particular applications, such as modeling capacitor dielectric breakdown, fit nicely into the “worst link” (or first of many flaws to produce a failure) extreme value theory. Dielectric materials contain many flaws, all “competing” to be the eventual catastrophic failure site. In many cases, the failures occur mostly early in life, and a Weibull with a shape parameter less than 1 works best. On the other hand, there is less reason to expect a Weibull to apply when failure is due to a chemical reaction or a degradation process such as corrosion, migration, or diffusion

97

Weibull Distribution

(although even here the “many competing sites” argument might still possibly apply). It is in precisely such applications, typical of many semiconductor failure mechanisms, that the lognormal distribution (Chapter 5) has replaced the Weibull as the most popular distribution. One particular form of Weibull deserves special mention. When β = 2, as noted in Table 4.2, the distribution is called the Rayleigh. The failure rate increases linearly with h(t) = 2t/α 2 and the CDF is given by F(t) = 1 − e− (t/α )2

There is an interesting measurement error problem that also leads to this same CDF. Assume you are measuring, or trying to locate, a particular point on the plane. A reasonable model that is often used measures independent x and y coordinates. Each measurement has a random amount of error, modeled as usual by the normal distribution (Chapter 5). Assume each error distribution has zero mean and the same standard deviation, σ. If the error in the x direction is the random variable X and the error in the y direction is Y, then the total radial error (or distance from the correct location) is

R = X2 + Y2 Using standard calculus methods, the CDF of R can be derived. It turns out to be F ( r ) = 1 − e −( r

2 /2σ 2

)

which is the Rayleigh distribution with α = 2 σ . EXAMPLE 4.3 RAYLEIGH RADIAL ERROR An appliance manufacturer wants to purchase a robot arm to automate a particular assembly operation. The quality organization has been asked to evaluate the reliability of an arm under consideration. A key point in the evaluation is whether the arm can repeatedly go to specified points in its operating range, within a tolerable margin of positioning error. The literature on the arm says that it will repeatedly arrive at programmed points with an accuracy of ±0.3 cm in either the x-coordinate or the y-coordinate direction. Tests have determined that the operation will succeed as long as the arm arrives no further than 0.4 cm from the designated point. If an error rate of less than 1 in a 1000 is required, will the arm under consideration be adequate?

Solution This example is instructive, not only of the Rayleigh distribution but also of the kind of detective work a statistician or reliability analyst must often carry out. What does an accuracy of ±0.3 cm really mean? Often a phone call to the supplier will not produce an immediately satisfactory answer. In the meantime, an analyst can make an evaluation based on making typical assumptions, often involving knowledge of the normal distribution, as discussed in Chapter 5. Assume that ±0.3 cm refers to plus or minus three standard deviations (or sigmas) of the typical normal error distribution. Sigma is then 0.1 cm. If we also assume that the placement errors in each coordinate are independent with an average value of zero, the point the robot arm arrives at will have a random distance from the objective point with a CDF given by the Rayleigh distribution

F (r ) = 1− e − r 2 / 2(0.1)2 = 1− e −50 r 2

98

Applied Reliability

Substituting 0.4 for r yields F(0.4) = 0.9997. This result means the arm will be more than 0.4 cm off only about 3 times in every 10,000 operations, which meets the objective of less than 1 in 1,000. EXERCISE 4.9 An expert dart thrower claims that he can throw a dart repeatedly at a target and end up within 0.5 cm of the origin of the target in any direction. The bull’s eye is centered at the origin and has a diameter of 0.7 cm. If he throws a dart at the target 50 times, what is his expected number of hits within the bull’s eye? What assumptions are you making?

4.5 Weibull Parameter Estimation: Maximum Likelihood Estimation Method The recommended method for estimating β and α from data (either complete or censored, or grouped samples) is called the method of maximum likelihood estimation (MLE). It is a standard, well-known technique, described fully in most statistics textbooks (for example, see Wilks 1962). Its use for censored reliability data is described in detail in Nelson (1982) and also in Mann, Schafer, and Singpurwalla (1974) and Meeker and Escobar (1998). However, the maximum likelihood estimates (MLEs) for censored or multicensored data are generally difficult to compute, frequently requiring the solution of complicated simultaneous nonlinear equations. Until the mid-1990s, reliability software with built-in MLE routines for each of the common life distributions was scarce, usually expensive, and often difficult to use. Now, JMP and Minitab (and several other statistical software packages such as S-Plus) can analyze any kind of multicensored data and compute MLEs for all the popular life distribution models. In addition, with the aid of the built-in Solver routine, Excel can be used to calculate MLEs and parameter confidence bounds, as we will show in Appendix 4A. At this point, it is natural to inquire why a complicated method requiring powerful computer routines is recommended when simple, intuitive graphical techniques, combined with a curve-fitting routine that eliminates subjective judgment, can be used. In addition, the graphical approach offers an immediate visual test of whether the Weibull distribution fits the data or not (based on whether the data points line up in an approximate line on special Weibull graph paper or computer Weibull plots). The reason for the MLE recommendation has to do with the concept of accuracy, in a statistical sense. In a real-life problem, when we estimate parameters from data, the results can differ widely depending on the estimation method used. Important business decisions may depend on which estimate we choose. Obviously, it is important to have objective criteria that tell us which method is best in a particular situation. Accuracy, in a statistical sense, starts with a definition of the desirable properties an estimation method may have and continues with an investigation into which of these “good” properties the methods available actually have. Statistical theory describes how well various methods compare to each other in the long run—over many, many applications. We have to accept the logic that it makes sense to use the best long-run method on any single set of data we want to analyze.

99

Weibull Distribution

The most desirable attributes defined for estimation methods are the following:

1. Lack of bias: The expected value of the estimate equals the true parameter (or, the average, we are centered “on target”). 2. Minimum variance: A minimum variance estimator has less variability on average than any other estimator, and if also unbiased, it is likely to be closer to the true value than another estimator. 3. Sufficiency: The estimate makes use of all the statistical information available in the data. 4. Consistency: The estimate tends to get closer to the true value with larger size samples (“infinite” samples yield perfect estimates).

In addition, we want our estimation technique to provide a methodology for finding confidence intervals and carrying out tests of hypotheses. In general, no known method provides all of the attributes mentioned. Indeed, it may be difficult to find any method that is unbiased and also has a minimum variance when dealing with life distributions and censored data. However, the MLE method can be shown to possess all of the above properties as sample sizes (and numbers of failures) become large enough. This property, called asymptotic behavior, assures us that, for reasonable amounts of data, no other estimation technique is “better.” Asymptotic theory does not tell us “how large” is “large,” but practical experience and simulation experiments indicate that more than 20 failures are “large,” and, typically, if there are over 10 failures, the MLEs are accurate. For smaller amounts of data, the unbiased minimum variance property cannot be claimed, but better techniques are hard to come by. A loose but useful description of the MLE technique is as follows: The “probability” of the sample is written by multiplying the PDF f(t) evaluated at each data point. This product, containing the data points and the unknown parameters, is called the likelihood function. By finding parameter values that maximize this expression we make the set of data observed “more likely.” In other words, we choose parameter values that are most consistent with our data by maximizing the likelihood of the sample. The MLE technique is therefore equivalent to maximizing an equation of several variables. In general, the standard calculus approach of taking partial derivatives with respect to each of the unknown parameters and setting them equal to zero will yield equations that have MLEs as solutions. Although it may be possible to maximize the likelihood equation directly, in most cases, the calculations are simplified by first applying natural logarithms to the likelihood equation, and then taking partial derivatives of the negative of the log likelihood and solving for a minimum. The same parameter values that minimize the negative log likelihood will, of course, maximize the likelihood. However, for censored or grouped data, these equations are nonlinear and complicated to set up and solve. Consequently, appropriate software is needed. When the life distribution is the exponential, the MLE equations are easy to derive and solve, even for censored data. This procedure is shown in Example 4.4, which illustrates how the MLE method works. EXAMPLE 4.4 MLE FOR THE EXPONENTIAL Show that the estimate of the exponential parameter λ given in Chapter 3 (i.e., the number of failures divided by the total unit test hours) is the MLE estimate for complete or censored Type I (time censored) or censored Type II (rth fail censored) data.

100

Applied Reliability

Solution The likelihood equation (LIK) consists of two parts: (1) the product of the r PDFs for each failure time, and (2) the survival probability at the end of test raised to the power of the number surviving, n − r:

r

∏ f (t ) [1− F(T )]

LIK = k

n− r

i

i =1

For the exponential distribution, we get r

LIK = kλ

r

e

−λ

∑ ti i =1

(e − λT )n− r

where k is a constant independent of λ and not important for the maximizing problem. The last term in LIK is the probability of n – r sample units surviving past the time T. If T is fixed in advance, we have Type I censoring. If T is the time of the rth fail, we have Type II censoring. If r = n, the sample is complete or uncensored.

If we let L denote the negative log likelihood (without any constant term), then

r

L=λ

∑ t + (n − r)(λT ) − r ln λ i

i=1

To find the value of λ that minimizes L, we take the derivative with respect to λ and set it equal to 0:

dL = dλ

r

∑ t + (n − r)T − λr = 0 i

i=1

Solving for λ, we have the MLE given in Chapter 3: λˆ =

r r

∑ t + (n − r)T i

i=1

If the data are read out or grouped, the MLE, even in the simple exponential case, is not as easy to obtain. The likelihood equation is

m

LIK = kF(T1 )r1

∏ [ F(T ) − F(T i

i−1

r

)] i

[1 − F(Tm )]n−r

i= 2

where T1 , T2 , ..., Tm = the readout times; r1 , r2 , ..., rm = the failures first observed at those m times; F = CDF for the assumed life distribution; and r = ri = the total number of faili=1 ures out of n on test.

∑

101

Weibull Distribution

The partial derivative equations that are obtained in the Weibull case may be found in Chace (1976). For multicensored data with li units removed from the test at time Li, we li the LIK equation. These terms reflect the fact that we include terms of the form [1 − F(Li )]in know only that those li units have survived to time Li. We can also write the likelihood equation for truncated data. First we consider the case of left-truncated data. As described in Chapter 2, left truncation means that all the population values less than a fixed time TL are missing, and we do not even have a count of how many of them should have been in our sample. For example, we may have components burned in for TL h, but the supplier does not report the number of units that failed during burn-in. In other words, we sample only from the portion of the population that contains units that are known to have survived at least TL hours. The PDF and CDF models for this truncated population can easily be written in terms of the complete population PDF and CDF, using conditional probabilities, as follows: flefttrunc (t) =

f (t) , 1 − F(TL )

Flefttrunc (t) =

F(t) − F(TL ) , TL ≤ t ≤ ∞ 1 − F(TL )

TL ≤ t ≤ ∞

The likelihood equation for left-truncated data is obtained by substituting f lefttrunc and Flefttrunc for f and F in the formula for LIK given earlier. Thus, r

LIK = k

∏f

n− r

lefttrunc

(ti ) [ 1 − Flefttrunc (TL )]

i=1

−n

= k [ 1 − F(TL )]

r

∏ f (t ) [1 − F(T )]

n− r

i

L

i=1

The result is the same as writing the LIK function as if there were no truncation and then multiplying by [1 – F(TL)]−n, where n is the total number in the sample, including all items that either failed or were eventually censored. Now consider a sample drawn from a right-truncated population where all observations greater than TR are missing and the number of them is not known. For example, a large, random sample of units includes an unknown number with a defect that will lead to failure. At the end of test at TR hours, we do not know the number of defective units remaining in the sample, and consequently we have right truncation. The multiplier that converts nontruncated data to right-truncated data is [F(TR)]−n. If the sample is drawn from a population that has both ends cut off (i.e., both left and right truncated), then the multiplier is [F(TR) – F(TL)]−n, where TR is the right truncation time and TL is the left truncation time. Let us prescribe a general method for writing the LIK equation corresponding to a given set of data. Each observation that is an exact failure time contributes a term of the form f (ti ), where ti is the time of failure. Each observation that indicates a unit failed somewhere between two times, say Ti−1 and Ti, needs an expression of the type [ F(Ti ) − F(Ti−1 )]. Each observation for a unit still surviving at Li hours, with no further information known about that unit after Li hours, generates a statement of the form [1 − F(Li )]. In the last situation, Li is sometimes called a runtime. If the sample comes from a left-truncated population, each observation (whether a exact time failure, a readout time failure, or a runtime) contributes

102

Applied Reliability

an additional multiple of the form [1 – F(TL)]−1. If the sample comes from a right-truncated population, the additional multiple for each observation is [F(TR)]−1. If the sample comes from a population that is both left (at time TL ) truncated and right (at time TR) truncated, the additional multiplier for each observation is [F(TR) – F(TL )]. All of these terms, with a constant multiplier term that does not affect the MLE solutions and may be ignored, are multiplied together to derive the LIK equation. Take the negative natural logarithm of LIK to obtain L. The parameter values that minimize L are the MLEs. To elicit all the information needed to write the LIK equation, we ask the following four questions:

1. What are the exact times of failure, if any? 2. What are the starting and ending times of all intervals that contain failures whose exact failure times are not known, and what are the corresponding numbers of failures for each interval? 3. What are the censoring or runtimes when unfailed units are no longer observed, and what are the associated numbers of censored units? 4. Is the sample taken from a left- or right-truncated (or both left and right) population?

With the answers to these questions and an assumed model for F(t), we can set up the LIK equation and try to solve for the MLEs of the unknown parameters. In general, numerical algorithms are needed to solve the nonlinear MLE equations, and convergence to a correct solution depends on the starting values chosen by the analyst or the computer routine. Success is not always guaranteed. A useful part of the large sample theory for MLE estimates is that they have an asymptotic normal distribution. The mean is the true parameter value and the standard deviation can be estimated from equations based on partial derivatives of the log likelihood equations. This theory will not be given here; it is described in detail by Nelson (1982), and the calculations are often part of software that estimates MLEs from reliability data. This result can be used to obtain approximate confidence intervals for the unknown parameters. Another (generally better) method for obtaining confidence intervals and for testing hypotheses about the assumed distribution and its parameters is based on the behavior of the negative log likelihood function at and around the MLE values. This method uses likelihood ratio statistics, which are described in detail in Sections 8.4 and 8.5. JMP calls confidence intervals calculated this way profile-likelihood intervals and uses this method in its reliability platforms. One more use of MLEs deserves mention. As stated in Chapter 3, when doing a goodnessfit test to check a distribution assumption, the types of estimates for unknown parameters are MLEs. EXAMPLE 4.5 WEIBULL MLE PARAMETER ESTIMATION Capacitors, believed to have operating lifetimes modeled adequately by a Weibull distribution, were tested at high stress to obtain failure data. Fifty units were run and 25 failed by the end of the test. The failure times were continuously monitored, and the times of fail were, in hours to the nearest tenth, 0.7, 52.7, 129.4, 187.8, 264.4, 272.8, 304.2, 305.1, 309.8, 310.5, 404.8, 434.6, 434.9, 479.2, 525.3, 620.3, 782.8, 1122, 1200.8, 1224.1, 1322.7, 1945, 2419.5, 2894.5, and 2920.1. If the test ended at the last fail, use the MLE method to estimate the Weibull parameters. Next, assume only readout data were taken. The readouts were made at 24, 168, 200, 400, 600, 1000, 1500, 2000, 2500, and 3000 hours. The new failures observed at those readout times were 1, 2, 1, 6, 5, 2, 4, 1, 1, and 2. Again, find MLEs for β and α.

103

Weibull Distribution

Solution We will find Weibull MLEs using JMP and Minitab. MLEs obtained using Excel will be shown in Appendix 4.1A. All the data tables and worksheets used for this exercise are available from the “Downloads & Updates” tab on the web page for this book at http://www.crcpress.com/product/ isbn/9781584884668. Figure 4.4 shows how to create a JMP data table that can be used by the JMP Fit Parametric Survival platform to analyze right-censored, exact time of failure data. Set up a column (called “Time” in Figure 4.4) and place both the exact times of failure and the runtimes in that column. The “Censor” column has a zero in any row where the “Time” column has an exact time of failure. Put any nonzero number in the “Censor” column rows where the “Time” column represents a censoring or runtime. The “Frequency” column has the number of failed or censored items that correspond to the rows in “Time.” These columns can be called by any names that the analyst chooses. Figure 4.5 shows what the Fit Parametric Survival platform (under menu items Analyze, Survival, and Reliability) looks like after inputting the entries needed for a Weibull MLE analysis of the worksheet data. After clicking on Run Model, the output shown in Figure 4.6 appears. In release 7 of JMP, the Parameter Estimates shown in Figure 4.6 are the extreme value distribution parameters λ = 8.5327 (here called Intercept) and 𝛅 = 1.6112. Displaying the optional Alternate Parameterization gives the Weibull MLE parameters βˆ = 0.621 and Intercept = αˆ = 5078.3. Note that we can also tell JMP to compute approximately 95% confidence bounds for λ and 𝛅 (using the profile-likelihood

FIGURE 4.4 JMP data table for exact times, censored data analysis.

104

Applied Reliability

FIGURE 4.5 Inputs for JMP Fit Parametric Survival analysis—exact times.

FIGURE 4.6 JMP Weibull MLEs for Exercise 4.5 exact times, censored data.

method). The minimized L value of 75.4708 is shown in the –LogLikelihood column (either Full or Reduced). Loosely speaking, the parameter values that minimize L for a particular model choice produce the best fitting model for the given data set. Any other parameter choices will produce an L at least as large as this MLE value of L, and the smaller L is, the better the model fits the data.

Weibull Distribution

105

Figure 4.7 shows how to set up a JMP data table that can be used by the JMP Fit Parametric Survival platform to analyze interval (readout) data. We need two columns (called “Start Time” and “Stop Time” in Figure 4.7). These contain the starting points and the stop points for every readout interval. Leave the beginning of the first interval (starting time = 0) blank. For censoring times, put the runtime in the “Start” column and leave the “Stop” column entry blank. The “Frequency” column has the number of failed or censored items that correspond to the row entries in the other two columns. No “Censor” column is needed. Again, these columns can have any names that the analyst chooses. Figure 4.8 shows the inputs to the Fit Parametric Survival analysis routine for interval data. The results are shown in Figure 4.9, where we see that the parameter MLEs are αˆ = 5024.96 and βˆ = 0.643 and the corresponding minimized value of L is 91.414.

FIGURE 4.7 JMP data table for interval data from Exercise 4.5.

FIGURE 4.8 Inputs for JMP Fit Parametric Survival analysis—interval data.

106

Applied Reliability

FIGURE 4.9 JMP Weibull MLEs for Exercise 4.5 interval data.

FIGURE 4.10 JMP data table for exact times treated as interval data.

The data table created in Figure 4.7 can be generalized to handle exact times and multicensored mixtures of exact times, readout times, and runtimes. Figure 4.10 shows how the exact time data can also be analyzed by treating an exact time as a degenerate interval with the same start and stop times. The same inputs shown in Figure 4.8 for this new data table produce the same MLEs that were found by the exact time analysis previously shown in Figure 4.9.

Weibull Distribution

107

FIGURE 4.11 Minitab analysis for exact times, right-censored data analysis.

We turn now to a Minitab analysis of the Example 4.5 data. The exact failure time data can be entered as shown in Figure 4.11. Note that the columns are similar to those constructed in Figure 4.4 for the JMP data table, except that the Minitab censoring code convention calls for a “1” when the time row corresponds to an exact failure time, with any other number indicating a runtime (JMP uses “0” for an exact failure time, with any other number indicating a runtime). In addition, we have to click on the Censor button shown in the Parametric Distribution Analysis Right Censoring screen to tell Minitab which column to use for the censoring codes. We also have to click on the Estimate button to change from the least squares estimation method default to the better MLE method. When these choices have been made, clicking OK gives you the same MLEs shown in Figure 4.6. The confidence bounds will be slightly different, as Minitab (in release 15) uses a different asymptotic method (generally considered less accurate than the intervals obtained from the profile-likelihood method). The Minitab output is as follows: Distribution: Weibull Parameter Estimates Standard 95.0% Normal CI Parameter Estimate Error Lower Upper Shape 0.620651 0.113688 0.433439 0.888723 Scale 5078.30 1826.87 2509.03 10278.5 Log-Likelihood = −226.683.

Note that Minitab produces a minimized L value of 226.683, which is 151.212 larger than the value reported by JMP. This result occurs because JMP does not include the sum of the natural logarithms of all the exact times of failure in L when it calculates the negative log likelihood sum. The log likelihood terms are a legitimate part of L, but they do not play a role in the MLE calculation and are ignored by JMP. JMP does this for all MLE calculations with exact failure times and not just for Weibull data. So, when there are exact times of failure, the JMP value for L will differ from Minitab by the sum of the natural logarithms of the exact times. Interval data can be analyzed using the Minitab Parametric Distribution Analysis Arbitrary Censoring platform. The worksheet and input screen are shown in Figure 4.12. Note that the

108

Applied Reliability

FIGURE 4.12 Minitab analysis inputs for interval data.

worksheet setup is similar to that shown for JMP in Figure 4.7, except that any blank cells used in the JMP data table for either the start of the first interval or for runtimes are replaced by asterisks (*) in the Minitab worksheet. No censor column is needed, but we still have to choose the MLE method after clicking on the Estimate button. The Parametric Distribution Analysis Arbitrary Censoring platform will also accept exact time of failure data formatted similarly to the JMP data table shown in Figure 4.10, but the analyst must be careful to input asterisks (*) instead of blank values and make sure all columns are treated as numerical data. With properly formatted input, this platform is capable of handling any kind of multicensored data input. The Minitab output after selecting OK has the same MLEs and minimized negative likelihood as shown by JMP in Figure 4.9.

In Appendix 4.1A we show how to obtain Weibull MLEs for both exact times to failure and interval data using a spreadsheet. EXERCISE 4.10 A reliability analyst is asked to estimate Weibull parameters based on life test data for a sample of components he will test at high temperature. He has three identical ovens and he can use them for a maximum time of 1000 hours, and he has a random sample of 75 components to test. He puts 25 components in each oven and is able to record exact (to the nearest hour) times of fail only from oven 1. He reads out numbers of new failures from oven 2 at 24, 168, 400, 800, and 1000 hours. He has less control over oven 3 and only gets readouts of new failures at 100, 200, 400, 600, and 700 hours (after 700 hours, oven 3 is no longer available for his use). Oven 1 records 23 failures at these times: 95, 132, 211, 223, 240, 280, 311, 395, 442, 494, 511, 523, 529, 556, 641, 728, 738, 819, 836, 837, 889, 890, and 987 hours. There were 2 unfailed units left at 1000 hours. Oven 2 records 0 fails at 24, 1 fail at 168, 6 new fails at 400, 12 new fails at 800, and 2 new fails at 1000 hours. There were 4 unfailed units left at 1000 hours. Oven 3 records 0 fails at 100, 0 fails at 200, 6 fails at 400, 3 new fails at 600, and 2 new fails at 700 hours. There were 14 unfailed units left at 700 hours.

109

Weibull Distribution

Using JMP or Minitab, estimate Weibull MLEs for each oven’s data separately and then for all the data combined. What are the minimized negative log likelihood (L) values for the three separate cells and the combined one-cell analyses? EXAMPLE 4.6 WEIBULL MLE PARAMETER ESTIMATION: LEFT-TRUNCATED DATA A company buys components from a vendor who is known to do a burn-in prior to shipping the parts that is equivalent to 5000 hours of normal field operation for the component. The vendor claims that the population of component failure times follows a Weibull life distribution model with a shape less than 1 (indicating a decreasing failure rate and making the burn-in an effective way to reduce early failures). The company keeps careful track of all the field failures for a sample of 101 of these components for 15,000 hours of normal field operation. There were 32 failures at the field duration times shown in Table 4.4, and 69 components were still operating at the end of the 15,000-hour observation period. Unfortunately, the vendor does not have any burn-in fallout information for you to use. Estimate Weibull parameters for the preburn-in vendor population of components using only the field data for these 101 components that are all known to have survived the burn-in.

Solution These values represent left-truncated data with a truncation time of T = 5000 hours. We start by adding 5,000 to each of the times shown in Table 4.4 and also to the censoring time of 15,000 hours. Now, we have a sample of 101 failure and runtimes, with time zero starting before the burn-in. The rule for writing likelihood functions says to first ignore truncation and then multiply this untruncated likelihood by [1 – F(5000)]–101. However, this is equivalent (mathematically in the likelihood formula) to adding an imaginary right censoring time of 5000 hours, at which time a negative 101 “unfailed units” are “taken off test.” We should be able to enter this fictitious censoring time into Minitab or JMP, along with the exact times of failure and the real censoring time, and then obtain MLEs. Unfortunately, earlier releases of these programs (up to Minitab 15 and JMP 7) cannot handle negative frequencies. This capability was included in later releases of JMP 7 and is available in JMP 8. However, the iterative algorithm used by JMP uses up its iteration limit before converging for this data. In Appendix 4.2A, we describe how to use the Excel spreadsheet Genweibest.xls to solve this problem. The MLE result obtained for α is 40,432 hours, and the MLE for β is 0.745. Note that these parameters are for the population of components as manufactured prior to burn-in. Readers may verify that analyzing the field data ignoring the burn-in completely leads to a Weibull shape estimate of nearly 1 (0.953), and this would give the impression that a burn-in is wasteful because the failure rate for this model is almost constant. An even more misleading shape estimate of 1.8 would be obtained if the data were analyzed by adding 5000 hours to the field event times but ignoring the fact that the samples came from a truncated population. The determination of likelihood profile confidence intervals for Weibull MLE parameters using a spreadsheet is described in Appendix 4.3A. TABLE 4.4 32 Field Failure Times from 101 Burned-In Components 454 620 974 1063 1265 1936 2351 2364

2976 3095 3381 3475 4055 5592 5641 5765

5822 7913 8660 8682 8999 9002 9712 9960

10338 11337 11368 11904 12210 12776 12859 14036

110

Applied Reliability

4.6 Weibull Parameter Estimation: Linear Rectification A second estimation method for Weibull parameters is based on a procedure called linear rectification. The idea is to put the Weibull CDF equation into a form that, with the proper substitution of variables, is linear. The equations are as follows: F(t) = 1 − e− (t/α )β

ln [ 1 − F(t)] = −

{

t α

β

}

ln − ln [ 1 − F(t)] = β ln t − β ln α

Y = βX + b

In the final linear form, Y is the estimated value of ln {− ln [1 − F(t)]}at time t, and X is the natural log of t. To estimate Y, we have to estimate F(t). The estimate to use, as described in Chapter 6, depends on whether exact times of failure are available (in that case, we use median ranks approximated by Fˆ (ti ) = (i − 0.3)/(n + 0.4) for the ith failure time), or whether the data is readout; use Fˆ (ti ) = (total number of failures to time ti)/n. When the calculations are completed, there is an (X, Y) pair for each data point or readout time. A least squares fit, or regression of Y on X, yields estimates for the slope β and ˆ ˆ the intercept bˆ = −β ln α. Estimate α by αˆ = e−b/β . Programs or calculators that will accept the (X, Y) pairs as inputs, and give least squares estimates of β and b as outputs, are common. (A discussion of using the least squares method to fit a line to data points and linear rectification is given in Chapter 6.) Weibull graph paper, discussed in Chapter 6, has scales adjusted in the same way as the transformations that obtained (X, Y) from the CDF estimate and the time of fail. Therefore, a plot of the CDF estimates versus time on this specially constructed paper will yield an approximate straight line with theoretical slope β and intercept –β ln α, provided the data follows a Weibull model. The computer least squares procedure based on linear rectification is an objective way to put a line on Weibull graph paper that minimizes the squared deviations in the cumulative percent failure scale direction. In many standard statistical analyses, the estimates obtained via least squares or regression methods can be shown to have very desirable properties similar to those described for MLEs. However, there are several key assumptions about the (X, Y) points that must be made in order for the least squares “optimality” properties to hold. Basically, the random errors in Y at X points must be uncorrelated and have zero average value and the same variance. All three of these assumptions are known not to hold in the application of least squares for reliability data plotting described above. This method of estimating Weibull parameters gives convenient analytic estimates, which will be good for large amounts of data (consistency property). Little else can be said for them in terms of desirable properties. Also, any confidence bounds on β (or b) given in the output of a regression program are not valid for this application. CDF values calculated using either MLE or graphical estimation will seldom differ significantly for times within the range where experimental failures occurred. However, we are often concerned with extrapolating back to very early times in the front tail of the life distribution—percentiles much smaller than experimental sample sizes allow us to

111

Weibull Distribution

observe directly. Here, small changes in estimated parameter values (especially in the shape parameter) can make orders of magnitude difference in the CDF estimate or the AFR. Hence, it is important to use the best technique available for final parameter estimates, which is generally the MLE method. Chapter 6 will describe techniques for plotting Weibull data and obtaining quick parameter estimates. Even though these estimates are not recommended for use in critical applications, the value obtained from looking at your data on the appropriate graph paper cannot be overemphasized. Sometimes a strange pattern of points on graph paper may cause an analyst to ask questions that lead to a valuable insight that would have been lost had the entire analysis been done by computer programs. Even if the plot only serves to confirm the model chosen, it is useful for presentation and validation purposes. EXERCISE 4.11 Using the exact failure time data from Example 4.5, plot the estimated CDF [the plotting position for the ith failure time is (i – 0.3)/(n + 0.4)] at each failure time on a linear-by- linear graph. Then plot the Y and X values (calculated as described in this section) for the exact failure times on a linear-by-linear graph. Comment on the results. EXERCISE 4.12 Using the readout data from Example 4.5, plot the estimated CDF (the plotting position for i total failures is i/n) at each readout time on a linear-by-linear graph. Then plot the Y and X values for the readout failure times on a linear-by-linear graph. Comment on the results.

4.7 Simulating Weibull Random Variables As we discussed in Chapter 1, random variables may be simulated through the use of the uniform random variable defined in the interval (0,1) and inverse expressions for the desired CDFs. Specifically, if F is any CDF with an inverse F−1, then by substituting the unit uniform random variable U for p in the inverse expression F−1(p), we generate a random variable distributed according to F. For the Weibull distribution with shape parameter β and characteristic life α, the inverse expression is as follows:

F −1 ( p) = t = α − ln (1 − F(t))

1/β

Thus, substitution of the unit uniform random variable U for p = F(t) gives the following:

F −1 (U ) = t = α[− ln (1 − U )]1/β

Because 1 – U and U both represent a uniformly distributed variate in the interval (0,1), for purposes of simulation, we may write the inverse expression as follows:

F −1 (U ) = t = α[− ln U ]1/β

To generate samples from U (i.e., from a uniformly distributed variate defined on the unit interval), we can use the spreadsheet function RAND(). We can also use the JMP

112

Applied Reliability

formula editor to generate uniformly distributed variables. More directly, we can use Minitab (open Calc to Random Data to Weibull) to simulate Weibull data with our choice of characteristic life (scale) and shape parameter. NO T E :

JMP 7 or later has the capability to directly simulate Weibull data. Class Project 4.1: Simulating Weibull Random Variables

A method to generate the unit uniform random variables is needed. Calculators with such a capability may be used. The spreadsheet function RAND() is an easy approach. However, if a supply of 10-sided dice of various colors is obtainable, rolling the dice can be fun and instructive. Each student takes three different colored 10-sided dice and designates one color as the first digit, the second color as the second digit, and the third color as the last digit. Thus, the roll of the three dice will simulate a uniform random variable in the interval 0 to 999. Adding 1 to the number so obtained and dividing by 1000 will make the interval 0.001 to 1.000. Each student rolls the three dice 10 times to simulate 10 uniform pseudorandom variables. Add 1 to each number and divide by 1000 to get 10 uniform unit random variables in the interval (0.001, 1.000). Substitute each value for U so obtained into the Weibull inverse expression. One-third of the class will use the shape parameter β = 0.5; one-third, β = 2; the remaining one-third will use β = 10. The characteristic life α will be equal to 100 hours in all cases. As a result, each student generates 10 pseudorandom, unordered, Weibull variables based on specific parameters. Those students with the same β value will combine results and draw a histogram on the blackboard or on an overhead transparency for other class members to view. Each student may wish to calculate the median time to failure and compare with the median expected from the formula: α(ln 2)1/β = α(0.6931)1/β

EXERCISE 4.13 Simulate data from a stress test on 20 units, assuming the failure times follow a Weibull distribution with α = 5000 hours and β = 1.5. Estimate the MTTF, median, and standard deviation. Arrange the observations from smallest to largest. How many failures occur by 2000 hours? By 4000 hours? By 6000 hours?

4.8 The Three-Parameter Weibull Distribution As described in Section 3.14 for the exponential, we can add a location parameter to the Weibull model by introducing a waiting or threshold time before which failures cannot occur. The three-parameter Weibull density function is

f (t) =

β t−

=0

t− α for

β

e− ((t− t<

,

)/α )β

≤t t)

For the smallest order statistic to be greater than t, it is necessary that all n times in the random sample be greater than t. Since 1 − F(t) is the probability that the random variable T is greater than t, we have

F(1) (t) = 1 − [1 − F(t)]n

188

Applied Reliability

The median of the distribution F(1)(t) is found by setting F(1) (t) = 0.5 in the above equation and solving for F(t) to get the median rank, FM(1) (t), that is,

FM( 1) (t) = 1 − 0.51/n

For example, the median rank for the minimum in a sample of n = 5 is 1 − 0.51/n = 0.12945. EXERCISE 6.1A Show for a random sample of five drawn from the exponential distribution with MTTF θ that the CDF of the first-order statistic (the minimum) is also exponential with MTTF = θ/5. Determine the CDF for the fifth-order statistic (the maximum). EXERCISE 6.2A Using the results from the previous exercise, plot the CDFs of the smallest and largest order statistics. Differentiate the CDFs to obtain the PDFs and create a graph of the PDFs.

We now consider the general result for F(r)(t) following the approach by David and Nagaraja (2003). For the rth order statistic T(r) to be less than t, at least r of the failure times Ti must be less than or equal to t. For example, in the random sample of five, the probability that the third-order statistic T(3) is less than or equal to t includes the events that three, four, or five of the times Ti are less than or equal to t. Since the probability that Ti ≤ t is given by the CDF F(t), the probability that exactly i of the Ti are less than or equal to t is described by the binomial distribution (see Chapter 11). Consequently, the CDF F(r)(t) for the rth order statistic can be expressed as the sum of the individual binomial probabilities that exactly i of the Ti are ≤ t n

F(r ) (t) = Pr(T(r ) ≤ t) =

∑ i= r

n [F(t)]i [1 − F(t)]n− i i

Note this result applies to any distribution with CDF F(t). EXERCISE 6.3A Consider a random sample of five drawn from the exponential distribution with MTTF θ. Determine the CDF for the second-, third-, and fourth-order statistics. EXERCISE 6.4A Using the results from the previous exercise, plot the CDFs of the second-, third-, and fourth-order statistics. Differentiate the CDFs to obtain the PDFs and create a graph of the PDFs.

We have specified above the CDF F(r)(t) for any rth order statistic. Consequently, we should be able to find certain distributional properties such as the expected value or variance. Note that such parameters depend upon the specific form of the underlying distribution F(t). For example, if F(t) is uniform on [0,1], David and Nagaraja (2003) show that the expected value for the rth order statistic is r/(n + 1). Some authors such as Plackett (1959) have used tabulated expected values of the rth order statistic, assuming F(t) is normally distributed, for plotting positions. To avoid any distributional assumptions and also because the distribution F(r)(t) may be non-symmetric or skewed, we focus on the median value of the distribution for the rth order statistic. Thus, we seek a general estimate for the population median CDF FM(t) in the equation for the median of the rth order statistic:

189

Reliability Data Plotting

n

F(r ) (t) =

∑ i= r

n [FMr (t)]i [1 − FMr (t)]n− i = 0.5 i

How do we solve for FMr (t) ? It can be shown (see Wackerly, Mendenhall, and Scheaffer 2008), that the cumulative binomial distribution is related to another probability distribution called the incomplete beta function. The relationship is n

∑ i= r

n [F(t)]i [1 − F(t)]n− i = I F (t ) (r , n − r + 1) i

where IF(t)(r, n−r+1) is the cumulative distribution function for the beta distribution. Beta distribution tables are available (see Beyer 1968), but to solve easily for FMr (t) in the median equation, we can use the inverse beta function in a spreadsheet. The format for the median rank associated with the rth failure time is BETAINV(0.5,r,n−r+1), for n units on test. To summarize, the population CDF F(t) evaluated at the median of the distribution of the times for the rth failure corresponds to the median rank plotting position. We show this property with the following example for an exponential distribution. EXAMPLE 6.1A ORDER STATISTICS FOR EXPONENTIAL DISTRIBUTION Five components randomly drawn from an exponential distribution with MTTF = 100 hours are stress tested. What is the median for the distribution of the first-order (minimum) statistic? What population CDF value corresponds to this median time? What is the median rank for the first-order statistic?

Solution The distribution of the first-order statistic, that is, the minimum, of a sample of five from an exponential distribution with θ = 100 hours is given by

t

F(1) (t ) = 1 − [1 − F (t )]5 = 1 − [e − θ ]5 = 1 − e −

t

20

which is also exponential with MTTF(1) = 20 hours. To find the median for this distribution, we solve for tmed as shown:

F(1) (tmed ) = 1− e

− tmed

20

= 0.5

tmed = −20 ln(0.5) = 13.863 The population CDF F(t) at this median value is

F (tmed ) = 1− e

− tmed

100

= 0.129449

To find the median rank for the first-order statistic, we use the spreadsheet inverse beta function BETAINV(0.5,1,5−1+1) that returns 0.129449, which is identical to the result found using the distribution of order statistics. In addition to the median, the formula for CDF of the rth order statistic T(r) can be solved for other confidence levels to obtain confidence limits. For example, for 100(1 − α)% bounds, we can set

IF (t ) (r , n − r + 1) =

α 2

190

Applied Reliability

and

IF (t ) (r , n − r + 1) =

1− α 2

and solve using the BETAINV function. If the explicit form of the CDF F(r)(t) allows for direct solution of the quantile t, then confidence limits are easily obtainable for both the CDF and time. EXAMPLE 6.2A CONFIDENCE LIMITS ON ORDER STATISTICS FOR EXPONENTIAL DISTRIBUTION Five components randomly drawn from an exponential distribution with MTTF = 100 hours are stress tested. What is the 95% upper confidence limit for the distribution of the firstorder (minimum) statistic? What is the 95% lower confidence limit for the distribution of the first-order (minimum) statistic? What population CDF value corresponds to the median first-order statistic time? What are the corresponding rank estimates for these limits?

Solution From Example 6.1A, the distribution of the minimum, of a sample of five from an exponential distribution with θ = 100 hours is exponential with MTTF(1) = 20 hours. To find the upper 95% limit, we solve for t as shown: F(1) ( t ) = 1− e

−t

20

= 0.95

t = −20 ln(0.05) = 59.915 The population CDF F(t) at this median value is

F ( t ) = 1− e

−t

= 0.45072

100

To find the 95% upper rank estimate for the first-order statistic, we use the spreadsheet inverse beta function BETAINV(0.95,1,5−1+1), which returns 0.45072, identical to the result found using the distribution of order statistics. To find the lower 95% limit, we solve for t as shown below:

F(1) ( t ) = 1− e

−t

20

= 0.05

t = −20 ln(0.95) = 1.02587 The population CDF F(t) at this median value is

F ( t ) = 1− e

−t

100

= 0.010206

To find the 95% lower rank estimate for the first-order statistic, we use the spreadsheet inverse beta function BETAINV(0.05,1,5−1+1), which returns 0.010206, identical to the result found using the distribution of order statistics. EXERCISE 6.5A Five components randomly drawn from an exponential distribution with MTTF = 100 hours are stress tested. What is the median for the distribution of the fifth-order (maximum) statistic? What population CDF value corresponds to this median time? What is the median rank for the first-order statistic? Find 95% confidence limits on the CDF and time quantiles and plot the results.

191

Reliability Data Plotting

EXERCISE 6.6A Five components are randomly drawn from a normal distribution with mean μ = 100 and σ2 = 9. What is the median for the distribution of the first-order (minimum) statistic? What population CDF value corresponds to this median time? What is the median rank for the first-order statistic? Find 95% confidence limits on the CDF and time quantiles and plot the results. EXERCISE 6.7A Five components are randomly drawn from a Weibull distribution with characteristic life α = 100 and shape parameter β = 2. What is the median for the distribution of the first-order (minimum) statistic? What population CDF value corresponds to this median time? What is the median rank for the first-order statistic? Find 95% confidence limits on the CDF and time quantiles and plot the results.

Problems 6.1 Plot the (readout) data up to the “greater than 550 hours” point as given for the memory chip data in Table 3.3 using an exponential model. Do the points appear to follow a straight line? Add confidence limits. 6.2 Plot the 25 exponential data points generated in Example 3.17. Add confidence limits. Do the simulated data appear to follow an exponential model? 6.3 Plot the 25 failure times from Example 4.5, assuming a Weibull model. Repeat using the 10 readout intervals of Example 4.5. In each case, estimate the Weibull parameters from the plot. Add confidence limits. 6.4 Assume a proposed model for life data is

F(t) = 1 −

1 , 0 ≤ t < ∞, a > 0 1 + at

Describe how you would go through a “rectification” analysis allowing you to construct “probability” graph paper for this distribution. (Hint: The slope of lines on this paper will be 1/a; the intercept is 0.) Plot the five data points given in Example 1.9 on your constructed paper. Do they appear to follow a straight line with slope 1, as expected? 6.5 Plot the 30 failure times generated in Exercise 5.20 using a lognormal model assumption. Do the simulated data appear to follow a lognormal model? Estimate the lognormal parameters from the plot. Add confidence limits.

This page intentionally left blank

7 Analysis of Multicensored Data In Chapter 6, we analyzed complete and singly right-censored data. We discussed estimating the cumulative distribution function (CDF) F(t) at time t by various methods, including median ranks for exact failure times and a binomial estimate for grouped data. These procedures did not invoke any specific distributional form. The CDF estimate was distribution free or nonparametric. The estimate Fˆ (t) could be tested for goodness of fit using probability plotting, chi-square, or various empirical distribution function (EDF) statistics to see if a specific parametric distribution model—such as the exponential, Weibull, or lognormal—was reasonable for the data. For exact failure times or grouped readout data with single right censoring, CDF estimation was fairly simple. However, there are many typical situations in reliability involving multiple censoring times, for both exact and disjoint interval data. Also, data may consist of combinations of exact failure times and left-, right-, or overlapping interval-censored observations, called arbitrarily censored data. We may experience what is called current status data, where only one inspection is made on each item for the presence or absence of a property (e.g., cracks in metal). Also, data may be left or right truncated. Nonparametric approaches to handle such analyses are the topics of this chapter. Pointwise and simultaneous confidence intervals are also discussed.

7.1 Multicensored Data The power of the methods described in this chapter lies in their ability to handle multicensored data (see Chapter 2), which is very common in reliability work and survival analysis. Situations under which multicensored data may arise are often unpredictable. For example, units under stress may fail for causes other than the expected mechanism. Environmental incidents such as electrical disturbances may destroy a portion of components under stress at a random time. Limited resources, such as oven space or receiving samples for stress on varying dates, may cause reliability tests to start on different times. During a stress, unfailed units may be removed periodically from test to track the progress of physical failure processes. Having the techniques to analyze such data is extremely valuable and can save reliability studies from becoming total losses because of unforeseen events. 7.1.1 Kaplan–Meier Product Limit Estimation The Kaplan–Meier (1958) product limit (PL) estimator is a long name for a very simple and useful procedure for calculating nonparametric survival probability estimates for incomplete data, consisting of exact failure and censoring times or data grouped into nonoverlapping intervals. Nonparametric indicates that no failure distribution is assumed for analysis; that is, the methods apply to any lifetime distribution. In the PL approach, the probability of components surviving any time interval is estimated by the ratio of the observed number of units surviving among the number of units at risk of 193

194

Applied Reliability

failure during the interval. Based on conditional probability concepts, the survival estimates for all successive intervals are multiplied together to provide a cumulative survival or overall reliability estimate at time t. The procedure is best illustrated by the following example. EXAMPLE 7.1 KAPLAN–MEIER PRODUCT LIMIT ESTIMATES FOR EXACT FAILURE TIME DATA A very common situation in medical or reliability studies occurs when objects under observation become censored for reasons (death, withdrawal, relocation, etc.) unrelated to the purpose of the study. For example, consider a stress designed to investigate open circuit failures caused by electromigration in a chip. Eight samples are placed on life stress and failures occur at 200, 250, 300, 400, 450, 500, 600, and 700 hours. Failure analysis subsequently determines that the failures at 250 and 450 hours are the result of an accidental electrical overstress that occurred only on these two units. These two failures, which were not caused by or related to electromigration, prevent us from obtaining complete electromigration failure data on all units. These two units provide survival information only up to their running times, which we call “censored times.” Alternatively, we state that “losses” occur at 250 and 450 hours (see Figure 7.1). At 200 hours, there is one failure out of eight units, and the simple probability estimate for a unit surviving 200 hours is 7/8 = .875. (Median rank considerations will be discussed later in this section.) A censored observation occurs at 250 hours, but because there are no electromigration failures between 200 and 300 hours, the survival probability estimate of .875 remains unchanged. When the second failure occurs at 300 hours, only six units are at risk of failure. For those six units, the probability estimate for surviving 300 hours, given that they have already survived 200 hours, is 5/6 = .833. The overall probability of surviving from time 0 to 300 hours, considering only failure times, is then estimated by the product of the sequential survival probabilities P(T > 300) = (7/8)(5/6) = .729

In words, the probability of surviving 300 hours equals the probability of surviving 200 hours multiplied by the probability of surviving 300 hours given survival to 200 hours. In terms of conditional probability (see Section 1.5 on probability concepts in Chapter 1), P(T > 300) = P(T > 200)P(T > 300|T > 200)

Just prior to the failure at 400 hours, there are five units at risk that have survived from 300 hours. Consequently, the conditional 400-hour survival probability is 4/5. The PL 400-hour survival estimate is P(T > 400) = (7/8)(5/6)(4/5) = .583

Unit

1 2 3 4 5 6 7 8 0

100 = failure

200 = censored

300

400 Time (hours)

FIGURE 7.1 Eight units on stress: six failures and two censored (units 2 and 4).

500

600

700

195

Analysis of Multicensored Data

At 450 hours, a second unit is censored. Thus, at the 500-hour failure, given survival to 400 hours, there are only three units at risk, and the 500-hour survival estimate is P(T > 500) = (7/8)(5/6)(4/5)(2/3) = .389

The remaining survival estimates are similarly determined. The CDF F(t) at each failure time is estimated by subtracting the survival probabilities from 1. The estimated PL survival probabilities and CDF at all unique failure times are shown in Table 7.1. Step function plots of the survival curve and CDF are shown in Figure 7.2.

Let us now write the general expression for the Kaplan–Meier PL estimator (KMPLE). The notation may at first seem a bit daunting, but the final formula is easily understood. For n units starting at time t0 = 0, let t1 < t2 < < tk denote the observed failure times. Suppose that di units fail at ti. For exact (i.e., unique) failure times, di = 1. Suppose also that mi units are censored in the interval ti ≤ t < ti+1 at times ti 1 , ti 2 , … , timi . For example, if there are two units censored between the third failure at t3 and the fourth failure at t4 (d3 = d4 = 1), then m3 = 2 with two censoring times t31 and t32. The number of units at risk just before the failure at time ti is denoted as ni = (mi + di ) + (mi+1 + di+1 ) + + (mk + dk ). Basically, ni represents the sum of the number of units that can either fail or be censored at times t ≥ ti. Thus, the Kaplan–Meier PL estimate of the CDF F(t) is TABLE 7.1 Product Limit Estimated Survival Probabilities Time (hours) 200 300 400 500 600 700

(7/8) = 0.875 (7/8) (5/6) = 0.729 (7/8) (5/6) (4/5) = 0.583 (7/8) (5/6) (4/5) (2/3) = 0.389 (7/8) (5/6) (4/5) (2/3) (1/2) = 0.194 (7/8) (5/6) (4/5) (2/3) (1/2) (0/1) = 0.000

0.125 0.271 0.417 0.611 0.806 1.000

0

100

200

300 400 500 Time (hours)

FIGURE 7.2 Nonparametric survival and CDF curves.

600 700

CDF estimate

1 Cumulative failure probability

Survival probability

CDF Estimate

KMPLE survival probability versus time

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Product Limit Survival Estimates (Reliability)

800

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

100 200 300 400 500 600 700 800 Time (hours)

196

Applied Reliability

nj − d j Fˆ (t) = 1 − ∏ nj j=1 i

for i such that ti ʺ t. The product of individual terms is indicated by the multiplication i

symbol

∏ X j = X1X2 Xi. This formula represents a step function at each failure time. j=1

For example, using the data in Figure 7.1, we see that there are six failure times ti where i = 1, 2 , … , 6 and t21 , t31 are the two censoring times for the eight units. To estimate the CDF at, say, 325 hours, as t2 = 300 hours, we use (8 − 1) (6 − 1) Fˆ (325) = 1 − = 0.271 8 6

Thus, from the knowledge of the failure and censoring times, we can construct an estimate of the CDF for multicensored data. Then the estimated CDF can be tested for the applicability of any distributional form using the methods (e.g., probability plots) described in Chapter 6. It is possible to estimate the variance of the CDF estimate Fˆ (t) using a form of what is called Greenwood’s formula (1926), expressed as dj

i

[ Fˆ (t)] = [1 − Fˆ (t)]2 Var ∑

j=1

n j (n j − d j )

for i such that ti ʺ t. For example, the variance estimate of the CDF estimate at 325 hours is 1 ⎤ [ Fˆ (325)] = (1 − 0.271)2 ⎡ 1 + Var ⎢ 8(8 − 1) 6(6 − 1) ⎥ = 0.0272 ⎣ ⎦ ˆ An estimate of the standard error of F(t ) is

i

[ Fˆ (t )] = Var [ Fˆ (t )] se i i

The variance and standard error estimates of the CDF estimates for the failure times in Figure 7.1 are shown in Table 7.2. [ Fˆ (t)], we would like to establish Based on the CDF estimate Fˆ (t) and standard error se pointwise confidence intervals on the population CDF. A normal linear approximation two-sided 100 (1 – α)% confidence interval would be given by [ Fˆ (t)] [ F(t), F (t)] = Fˆ (t) ± z(1−α/2 ) se

TABLE 7.2

Variance and Standard Error Estimates Time (hours) 200 300 400 500 600 700

CDF Estimate

Variance Estimate

Standard Error Estimate

0.1250 0.2708 0.4167 0.6111 0.8056 1.0000

0.01367 0.02722 0.03443 0.04051 0.02903 0.00000

0.1169 0.1650 0.1856 0.2013 0.1704 0.0000

Analysis of Multicensored Data

197

However, a CDF is restricted between 0 and 1, and the linear approximation for small sample sizes can result in a negative lower confidence limit F(ti ) or an upper confidence approximation for a 95% limit F (ti ) greater than 1. For example, at 325 hours, the linear confidence interval gives [ F(325), F (325)] = 0.271 ± 1.96(0.165) = [−0.0525, 0.594] At 650 hours, the similar linear approximation interval is [0.472, 1.140]. Clearly, these bounds are unreasonable for the population CDF. Some commercial software programs apply the normal linear approximation and simply truncate values below 0 or above 1. Better approximations are possible by transforming the CDF estimate Fˆ (t), as first proposed by Borgan and Liestol (1990). The use of a transformation can create an unrestricted variable ranging between −∞ and +∞, similar to a normal random variable. For example, we can apply the log transformation ln[− ln(1 − Fˆ )], which is actually a log transform of the cumulative hazard function H(t) = −ln[1 – F(t)] discussed in Chapter 2. Then we determine the standard error of the transformed function, possibly asymptotically, with a normal approximation. The actual derivation is beyond the scope of this text, but an example of this approach is provided in Kalbfleisch and Prentice (2002). Application of the normal approximation to the log transformation results in the approximate 100 (1 – α)% nonsymmetric confidence interval given by

[ F(t), F (t)] = [1 − (1 − Fˆ )d , 1 − (1 − Fˆ )1/d ] ( Fˆ ) ⎤ ⎡ z1−α/2 se where d = exp ⎢ ⎥. For example, at 325 hours, a 95% approximate confidence ⎢⎣ (1 − Fˆ )ln(1 − Fˆ ) ⎦⎥ ⎤ ⎡ 1.96 * 0.1650 interval is found by first determining d = exp ⎢ ⎥ = 0.2455 and then ⎣ (1 − 0.2708)ln ( 1 − 0.2708 ) ⎦ evaluating

{1 − [1 − Fˆ (t)]d , 1 − [1 − Fˆ (t)]1/d } = [1 − 0.7292 0.2455 , 1 − 0.7292 1/0.2455 ] = [0.0746, 0.7236]

Meeker and Escobar (1998) describe an alternative approximation that uses the more symmetric logit transformation logit( Fˆ ) = ln[ Fˆ/(1 − Fˆ )], leading to a slightly simpler form for the two-sided approximate 100 (1 – α)% confidence interval

⎡ ⎤ Fˆ Fˆ , [ F(ti ), F (ti )] = ⎢ ⎥ ˆ ˆ ˆ ˆ ⎣ F + (1 − F ) × w F + (1 − F )/w ⎦

( Fˆ )/[ Fˆ (1 − Fˆ )]}. Again, because logit(F) can vary between −∞ and where w = exp{ z(1−α/2 ) se ∞, logit(F) is also unrestricted, similar to a normal random variable. These two methods assure that the confidence interval endpoints will always lie between 0 and 1. For example, for a 95% confidence interval at 325 hours, the logit transformation with

w = exp {1.96(0.165) [ 0.271(1 − 0.271)] } = 5.142

gives

⎡ ⎤ 0.271 0.271 [ F(ti ), F (ti )] = ⎢ , ⎥ 0 . 271 + ( 1 − 0 . 271 ) × 5 . 142 0 . 271 + ( 1 − 0 . 271 )/ 5 . 142 ⎣ ⎦ = [ 0.0674, 0.656 ]

198

Applied Reliability

TABLE 7.3 Two-Sided 95% Confidence Limits Normal Approximation Time (hour) 200 300 400 500 600 700

Logit Transformation

CDF Estimate

LCL

UCL

LCL

UCL

LCL

UCL

0.125 0.271 0.417 0.611 0.806 1

−0.104 −0.053 0.053 0.217 0.472

0.354 0.594 0.780 1.006 1.140

0.0186 0.0746 0.1559 0.2758 0.4366

0.6130 0.7236 0.8198 0.9370 0.9907

0.0173 0.0674 0.138 0.230 0.329

0.537 0.656 0.761 0.892 0.972

CDF estimates and 95% confidence limits

1.2

1

CDF LCL (Logit) UCL (Logit) LCL (Normal) UCL (Normal) LCL (Log Trf ) UCL (Log Trf ) CDF = 1

0.8

CDF (t)

Log Transformation

0.6 0.4 0.2 0

0

100

200

300

400

500

600

700

800

−0.2 Time t (hours) FIGURE 7.3 CDF and two-sided 95% confidence limits.

The results for all failure times are shown in Table 7.3 and graphed in Figure 7.3. In general, both the log-transformed and logit-transformed confidence intervals perform better than the normal approximation confidence interval, especially for small sample sizes. Both provide about the correct coverage probability of 95%, even with significant censoring. The log transform intervals are more conservative (e.g., wider) and less symmetric for smaller CDF values than the logit-based intervals. For very large samples, the three methods provide similar results. EXERCISE 7.1 Create plots of x (where 0 < x < 1) versus the logit(x) = ln[x/(1 – x)] and log transformation v(x) = ln(−ln(1 – x)) and compare results. Comment on the behavior for x near zero and x near 1.

199

Analysis of Multicensored Data

EXERCISE 7.2 Verify the results in Table 7.3.

Let us now see how JMP and Minitab handle exact failure data with multiple loss times. We shall use the data from Figure 7.1. SOLUTION TO EXAMPLE 7.1 VIA JMP The data is entered into a data table with two columns, as shown in Table 7.4. The first column contains both failure and censoring times. The second column designates the failure type with 0 for failure and 1 for a censored observation. In the main JMP menu, select Analyze, then Survival and Reliability, and Survival to obtain a dialog box into which we enter the appropriate columns as shown in Figure 7.4. We can choose to display either the survival or failure curve. Clicking OK produces the output shown in Figure 7.5. By clicking the red arrow to the left of Product Limit Survival Fit, we have selected Plot Options, Show Confid Interval. Additionally, we click, under the same red arrow, Save Estimates to obtain the table shown in Figure 7.6. JMP does not provide the actual numerical values of the confidence limits in a table. However, it is apparent from the graph and tabulated estimates that the logit transformation previously described in this section is used for the confidence interval estimates. TABLE 7.4 JMP Data Table for Exact Data Example

1 2 3 4 5 6 7 8

FIGURE 7.4 JMP dialog box for exact data example.

Times 200 250 300 400 450 500 600 700

Censor

0 1 0 0 1 0 0 0

200

FIGURE 7.5 JMP output for exact data example.

FIGURE 7.6 JMP save estimates table for exact data example.

Applied Reliability

201

Analysis of Multicensored Data

SOLUTION VIA MINITAB The data is entered into a worksheet with two columns, as shown in Table 7.5. The first column contains both failure and censoring times. The second column designates the failure type with F for failure and C for a censored observation. In the main Minitab menu, select Stat, then Reliability/Survival, followed by Distribution Analysis (Right Censoring), and next Nonparametric Distribution Analysis to get a dialog box into which we enter the appropriate information, as shown in Figure 7.7. Note the option boxes on the right side from which we select Censor… and specify the Censor column. Under Estimate…, we can choose to display failure or survival probabilities, and there are options within Graphs…, Results…, and Storage… that we can choose. A summary output is shown in Figure 7.8, along with the graph in Figure 7.9. Minitab provides the actual numerical values of the confidence limits in a table. The tabulated values show that the confidence interval estimates are based on the normal approximation, with negative limits set to 0 and limits exceeding 1 set to 1. One further point concerning the PL estimator for exact data can be made. Because the product limit estimate (PLE) reduces under no censoring, to the simple i/n plotting positions which we previously stated were not optimal, the estimator can be modified to produce approximate median rank estimators under multiple censoring. We adapt a modification similar to one proposed by TABLE 7.5 Minitab Worksheet for Exact Data Example

1 2 3 4 5 6 7

C1 Times 200 250 300 400 450 600 700

FIGURE 7.7 Minitab dialog box for exact data example.

C2-T Censor F C F F C F F

202

Applied Reliability

Michael and Schucany (1986). The following expression handles the situations with censoring, but it also yields the approximate median rank estimators at the failures times tj

n + 0.7 i n j − d j + 0.7 Fˆ(t ) = 1− ∏ n + 0.4 j =1 n j + 0.7

for i such that the failure times ti ʺ t. For example, using the same data in Figure 7.1, we see that

(8 + 0.7) (8 − 1+ 0.7) (6 − 1+ 0.7) Fˆ(325) = 1− = 1− 0.780 = 0.22 (8 + 0.4) (8 + 0.7) (6 + 0.7)

This modified expression improves the probability plotting positions when multiple censoring is involved. EXERCISE 7.3 A computer manufacturer had three major shipments to customers during the past year. At customer A, 100 new computers were installed on February 1st. At customer B,

FIGURE 7.8 Minitab summary output for exact data example.

Analysis of Multicensored Data

203

FIGURE 7.9 Minitab graph for exact data example.

83 days later, 200 new units were placed into operation. Finally, 200 days after the work at customer B, another 150 units started functioning at customer C. All computers operate 24 hours per day, 7 days per week. By the year’s end, customer A had reported failures on three computers, occurring at 512, 2417, and 7012 hours. Customer B had one failure at 3250 hours and another at 5997 hours. Customer C reported one failure at 105 hours. Failing components were not replaced. Estimate the CDF at each failure time for all computers using the Kaplan–Meier PL estimator.

7.2 Analysis of Interval (Readout) Data Obtaining exact times of failures and losses is not always possible. As we discussed in Chapter 2, often reliability data comes from periodic readouts, generating interval data. Let us consider a different data collection scenario for the results shown in Figure 7.1, consisting of six failures and two losses. Instead of measuring exact times of failures or losses, assume we perform five consecutive weekly readouts, that is, at 168, 336, 504, 672, and 804 hours. The results are summarized in Table 7.6. We are interested in estimating the survival probability at the end of each week. If we estimate the proportion entering each weekly interval that survive throughout the interval, then the end survival probability is estimated by the product of the estimated proportions for all prior intervals. However, the failures and censored observations occur randomly within each interval, and for the readout data, we are not certain when the losses occur relative to the failures. Consequently, the order of losses or failures within an interval is unknown.

204

Applied Reliability

TABLE 7.6 Summary of Readout (Interval) Data Start Time 0 168 336 504 672

End-Time

No of Failures

No of Censored

168 336 504 672 840

0 2 2 1 1

0 1 1 0 0

Consider the interval 168 to 336 hours with two failures and one loss. Eight surviving units entered the interval. Did the loss occur before, after, or between the two failures? The different survival probabilities for the interval would be (5/7), (6/8), or (7/8)(5/6), respectively. In such cases where the specific time sequence is indeterminate, several alternative approaches are possible. Let ni be the number of items entering the ith interval, di the number of failures during the interval, and ri the number of losses during the interval. If we assume all failures happen before any losses, the estimator for the ith interval is pi =

ni − di ni

If we assume all losses precede failures, the estimator is

pi =

ni − di − ri ni − ri

Another intermediate value is obtained from the well-known “adjusted–observed” or “actuarial” estimate used in life tables:

pi ( a ) =

ni − di − ri 2 ni − ri 2

This estimate is equivalent to assuming that half the censored observations within an interval precede and half follow the failures in that interval. For the second interval, the adjusted estimate is

p(2a ) =

8 − 2 − 1/2 5.5 = = 0.7333... 8 − 1/2 7.5

Additionally, there is a “joint risk” (Kaplan and Meier 1958) estimate pi for the ith interval given by

n − di − ri pi = i ni

di di + ri

This estimate is the maximum likelihood estimate when the losses are random, the instantaneous rate for losses is a constant multiple of the rate for deaths, and the only data given are the values ni, di, and ri. For the second interval, the joint risk estimate for the proportion surviving is

205

Analysis of Multicensored Data

8−2−1 p2 = 8

2 2+1

= 0.7310

We see that there is little practical difference between the actuarial and joint risk estimates. Using the latter estimate, the general expression for the Kaplan–Meier PL estimator for the CDF F(t) for interval data in which losses occur randomly during an interval is given by Fˆ (t) = 1 −

k

∏ i

ni − di − ri ni

di di + ri

Table 7.7 illustrates the joint risk and PL estimates under the assumption of random losses for the interval data. Note pi = 1 for an interval with no deaths or losses. As discussed in Section 7.1, confidence intervals using Greenwood’s formula can be applied to these PLEs. 7.2.1 Interval (Readout) Data Analysis in JMP and Minitab Let us now see how JMP and Minitab treat interval (readout) data with multiple censoring. Data entry for both programs is similar. Each interval has a start time and an end time. Right-censored observations are entered with a start time and a missing value (blank cell in JMP or an asterisk symbol “*” in Minitab) for the end time. Left-censored observations are entered with a missing value for the start time and the end time. For random censoring within an interval, there are several options available in Minitab or JMP. Neither has the joint risk estimation capability, but Minitab has an actuarial estimate and JMP can provide an approximate actuarial estimate by some manipulation of the data. In future versions, it is expected that JMP and Minitab will have the capability to enter random censoring of failures and censored observations occurring within the same interval, such as the data in Table 7.6. Currently in both JMP and Minitab, one is required to treat censoring as occurring either at the beginning or end of an interval. However, random censoring can be approximated in most cases by performing two calculations, one with censoring at the beginning of an interval and the second with censoring at the end of the interval, and then averaging the resulting survival probabilities for each interval. We will illustrate these approaches in the examples in Sections 7.2.2 and 7.2.3.

TABLE 7.7 Joint Risk and Product Limit Estimates for Readout (Interval) Data with Losses Occurring Randomly Interval Start Time 0 168 336 504 672

Interval EndTime

Number of Failures During Interval

Number of Censored During Interval

Units at Risk Entering Interval

Joint Risk Proportion Surviving Interval

PLE Survival Probability

PL CDF Estimate

168 336 504 672 840

0 2 2 1 1

0 1 1 0 0

8 8 5 2 1

1.0000 0.7310 0.5429 0.5000 0.0000

1.0000 0.7310 0.3969 0.1984 0.0000

0.0000 0.2690 0.6031 0.8016 1.0000

206

Applied Reliability

7.2.2 Minitab Solution To do the analysis in Minitab, we need to enter the data as shown in Figure 7.10. Note that we have added two censoring times at the midpoint of each interval in which censoring occurs, that is, 252 = (168 + 336)/2 and 420 = (336 + 504)/2. In the main Minitab menu, we select Stat, then Reliability/Survival, followed by Distribution Analysis (Arbitrary Censoring), and next Nonparametric Distribution Analysis to get a dialog box into which we enter the appropriate information, as shown in Figure 7.11. Note the option boxes on the right side from which we select Estimate… and select the Actuarial Estimate…. We can choose to display cumulative failure or survival probabilities, and there are options within Graphs…, Results…, and Storage… that we can select. Running the program produces the output as shown in Figure 7.12. 7.2.3 JMP Solution Again we consider the data in Table 7.6. Because JMP does not have an actuarial estimate, we need to do two runs in which the censoring points are modified and then average the results to obtain an approximate actuarial estimate. In JMP, the data tables assuming that

FIGURE 7.10 Minitab spreadsheet for readout example, censoring at beginning of interval.

FIGURE 7.11 Minitab dialog boxes for readout data example.

207

Analysis of Multicensored Data

FIGURE 7.12 Minitab output for readout example, actuarial estimate.

TABLE 7.8 JMP Data Tables for Readout Example, Censoring Occurring at Beginning and at End of Interval Beginning

Start 1 2 3 4 5 6 7

0 168 168 336 336 504 672

End

End

Freq 168 336 • 504 • 672 840

Start 0 2 1 2 1 1 1

1 2 3 4 5 6 7

0 168 336 336 504 504 672

End

Freq 168 336 • 504 • 672 840

0 2 1 2 1 1 1

the censoring occurs at the beginning and end of an interval are shown as Table 7.8. The interval data is entered into a data table with three columns. The first column is the start time for each interval. The second column is the corresponding end time. For censored observations at a readout (end of an interval), the time of censoring is entered into the start column and the end column is left blank. The third column is the frequency for the number of failures or losses in each interval. In the main JMP menu, we select Analyze, then Survival and Reliability, and Survival to obtain a dialog box into which we enter the appropriate columns, as shown in Figure 7.13. We can choose to display either the survival or failure curve. Running the program for each data table produces the output as shown in Figure 7.14. The resulting graphs are line charts without steps. When the red arrow next to ProductLimit Survival Fit is clicked, we can see various analysis options. There is no option to save estimates. We have shown the results for JMP assuming censoring at the beginning of the interval and censoring at the end of the interval. To estimate the CDF for the random censoring for the readout data, we now average the estimated CDFs for the two cases and end up with the result shown in Table 7.9. The results from averaging are very similar to the actuarial and joint risk estimates.

208

Applied Reliability

FIGURE 7.13 JMP dialog box for readout example.

Beginning

End

FIGURE 7.14 JMP output for readout example, assuming censoring at beginning and end of interval.

TABLE 7.9 Table of CDF Estimates for Readout Example, Random Censoring within Intervals CDF Estimates Time 336 504 672

Comparisons

Beginning (JMP)

End (JMP)

Average

Actuarial

Joint Risk

0.286 0.643 0.821

0.250 0.550 0.775

0.268 0.596 0.798

0.267 0.593 0.796

0.269 0.603 0.802

209

Analysis of Multicensored Data

7.3 Life Table Data Because multicensored interval data naturally occur in the analysis of human mortality rates, we will briefly discuss actuarial life tables, which represent the oldest form of estimating the survival function. Life tables are a form of interval data involving specified time periods, in contrast to the PL estimator, which has failure or survival probabilities estimated at the observed failure time points. The actuarial method groups survival times into time periods, resulting in some loss of precision. The PL estimate makes use of the known survival times. A life table shows the probabilities of a member of a particular population surviving (or dying) at a specific age. The specific periods of time are generally expressed in terms of equal intervals (years, weeks, months, etc.). In medical studies, it is a common practice to periodically add new groups (called cohorts) from the population of interest as a study progresses. Thus, for example, in the first year, there may be n1 individuals with selected properties (e.g., having a certain disease) included. In the second year, n2 patients might be added for the second cohort, and so on. Thus, at the end of five years, the study will have cohorts under observation for five years, others for four years, and so on. Consequently, the data is multicensored because at the end of the study the cohorts will have different censoring times. Additionally, some individuals in each cohort may be lost to the study for various reasons including death from other causes, moving away from the area, and so on. These losses must be factored into the analysis. Using the actuarial estimate, we will illustrate the procedure for the construction of a life table for multicensored data using an engineering example. EXAMPLE 7.2 ACTUARIAL LIFE TABLE ESTIMATION A company is concerned about the reliability of a new product received from a supplier. Typically 300 to 500 units are delivered each week for assembly into systems. Although the product was previously qualified, recent failures in prototype systems have been observed. A decision on following through with the planned shipment of this product in two months must be made. The reliability engineer decides that he will take a random sample of 10, 15, or 20 of the units received each week and placed them on a reliability stress. The units will be examined (readout) for failure at the end of each week. Six weeks will be allowed for the study. Occasionally, at random times a unit under stress will be removed from a group for physical failure analysis. At the end of 6 weeks, he will have some units on stress for 6 weeks, some for 5 weeks, and so on. Thus, the data is multicensored. The results of the reliability study are shown in Table 7.10. A notation commonly used in life tables is shown as the heading for the columns. We see from Table 7.10, using the first sample for illustration, that 15 units were placed on stress and followed for 6 weeks. In the first week, one unit failed and one unit was removed for failure analysis. Consequently, entering the second week were 13 units alive on stress. No units failed the second week. One unit failed during the third week, four during the fourth week, none in the fifth, and one in the sixth. Seven units survived the 6 weeks of stress. The other five samples are similarly viewed. We now construct a life table using a spreadsheet based on these results. We aggregate the data to create the columns shown in Table 7.11. The rows for the lx, dx, ux, and wx columns are the sums of the corresponding row entries in the matching columns in Table 7.10. The actuarial estimate is the basis for the column labeled “Effective Number of Units on Stress.” In typical life tables, the individuals randomly removed during an interval are assumed to have been exposed to risk for only half the interval. Thus, the effective number of units on stress during the interval x to x + 1 would be estimated using the formula

lx = lx −

ux 2

210

Applied Reliability

TABLE 7.10 Survival Data from Six-Week Reliability Study Alive at Beginning of Week lx

Failed During Week dx

Removed During Week ux

15 13 13 12 8

1 0 1 4 0

1

8

1

Second sample (week 2) 0–1 1–2 2–3 3–4 4–5

10 8 4 4 3

2 2 0 1 0

Third sample (week 3) 0–1 1–2 2–3 3–4

20 14 10 9

6 2 1 1

Fourth sample (week 4) 0–1 1–2 2–3

15 12 9

2 3 2

Fifth sample (week 5) 0–1 1–2

20 19

1 5

14

Sixth sample (week 6) 0–1

15

1

14

Week on Stress (x to x + 1) First sample (week 1) 0–1 1–2 2–3 3–4 4–5 5–6

Survivors After Final Week wx

7

2

3

2 8 1 7

TABLE 7.11 Life Table (Actuarial) Estimation of Failure Probabilities Effective Removed Number During Survivors of Units Week After Week on Stress ux wx lx′

Estimated Estimated Proba Proba bility of bility of Failing Surviving qx px

Estimated Estimated Cumulative Cumulative Probability Failure of Surviving Probability S(x + 1) F(x + 1)

Alive at Beginning of Week lx

Failed During Week dx

0–1

95

13

2

14

94

0.14

0.86

0.86

0.14

1–2

66

12

4

14

64

0.19

0.81

0.70

0.30

2–3

36

4

0

7

36

0.11

0.89

0.62

0.38

3–4

25

6

0

8

25

0.24

0.76

0.47

0.53

4–5

11

0

0

3

11

0.00

1.00

0.47

0.53

5–6

8

1

0

7

8

0.13

0.88

0.41

0.59

Week on Stress (x to x + 1)

211

Analysis of Multicensored Data

In medical studies, if the patients enter the study randomly during the year, then the effective number would include a similar adjustment on the survivors wx in a given time period, that is, l x = l x − (ux + w x ) 2. In our example, we know that the units in each sample entered the stress together at the start of each week. The estimated probability of failing during the interval x to x + 1, denoted by qx, is found by dividing the number of failures dx by the effective number of units l x . The probability px of surviving is 1 − qx. The cumulative probability of surviving from entry (time zero) through week x + 1 is denoted by S (x + 1) and is estimated by the product

x

S( x + 1) = p0 p1p2 px =

∏p

i

i =0

Note, for x ≥ 1, that S(x + 1) = S(x)px, so the cumulative survival probability column is formed by successive multiplications of the values in the px column, with S(1) = p0. The final column is created using the formula F ( x + 1) = 1 − S( x + 1. )

Actuarial Estimation in Minitab As mentioned earlier in Section 7.2, Minitab provides the capability to do actuarial estimates. The data from Table 7.10 is entered into a Minitab worksheet as shown partially in Figure 7.15. There

FIGURE 7.15 Minitab worksheet (partial) for Table 7.1 data.

212

Applied Reliability

is a column for the period “Start” time and one for the “End” time. For observations censored in a time period, there is an entry in the “Start” column for the midpoint of the interval and a “*” in the same row in the “End” column. The number of failures or censored observations in each time period is entered into the frequency column (“Freq”). We have added a fourth column (“Censor”) to identify rows with failures (F) or censored observations (C). Then, we select Reliability/Survival, Distribution Analysis (Arbitrary Censoring), Nonparametric Distribution Analysis… and enter the “Start,” “End,” and “Freq” columns in the dialog box. Under Estimate…, we choose Actuarial for the Estimation Method. Under Graphs…, we can select Survival Plot or Cumulative Failure Plot. Part of the Minitab output is the Actuarial Table and Cumulative Failure Plot as shown in Figure 7.16. Note in the output Minitab combines an interval in which there is no failure (here 4–5) with the next interval containing a failure (5–6) to produce an interval (4–6).

FIGURE 7.16 Minitab output (partial) actuarial table and failure plot.

Analysis of Multicensored Data

213

EXERCISE 7.4 Suppose a preliminary assessment of the cumulative survival probabilities is desired after only 4 weeks into the study. Using the data in Table 7.10 for the first four samples and only 4 weeks of stress, construct a life table estimation of the failure probabilities. EXERCISE 7.5 Repeat Exercise 7.4, but this time, use the Kaplan–Meier joint risk estimate instead of the actuarial estimate. Compare results of the two estimators.

7.4 Left-Truncated and Right-Censored Data Consider the following situation: We are interested in determining the life distribution after entering the field of a population of components that receive a 24-hour use conditions burn-in before shipment. We receive a sample of 100 post burn-in units to put on a 1000hour stress. However, the number of components that failed during burn-in to produce these 100 units is not provided. The stress test is run, and at the end of 1000 hours, 20 units have failed and 80 still survive. Here we have an example of data that is left-truncated and single right-censored. The CDF estimate at stress time t is for a population known to have survived 24 hours, that is, Fˆ (t) at stress time t = 0 starts at the value zero and is a conditional CDF estimate for the population that has already survived 24 hours of normal operation. This is truncated data because we know nothing about the population prior to 24 hours of use. If we knew how many failures had occurred to get the 100 survivors, it would be left-censored data. Note that data similar to this was analyzed in Chapter 4 using maximum likelihood estimation methods (see Example 4.6) for the Weibull distribution. Here we consider nonparametric estimates for the CDF. There are other situations in which left-truncated and multiple right-censored data can arise. An example is the Channing House data reported by Hyde (1980). This data gives the ages of death of elderly residents of a retirement community. The data is left truncated because only individuals beyond a certain age were allowed to enter the community. An unknown number of those who died prior to this age were excluded from the analysis. For each person joining the community, we have an age of entry and either the age of death or the censoring time. The analysis is based on the number of individuals actually present and at risk of death, when a death at a given age occurs. For small sample sizes, the CDF estimates may be highly variable. In such cases, it is best to start the CDF estimator at an age that yields a sufficient sample size and consider only death times beyond this age. The Kaplan–Meier PLEs have the remarkable property that they estimate the conditional CDF for units surviving to whatever age we start with, despite any earlier or later entries into the population (or multitruncated data). See Klein and Moeschberger (2003) for an analysis of the Channing House data. For reliability engineering work, an analogy might be the analysis of data in which only those units that have seen a minimum age in the field are studied over some period of time. For example, a component manufacturer might be interested in the second-year survival curve for units having at least 1 year in operation. An unknown number of units less than 1 year of age may have already experienced the failure

214

Applied Reliability

event of interest, leading to left-truncated data. Some items may fail during the second year and others may have no failures in the second year, resulting in right-censored observations. The analysis of left-truncated and right-censored data can be easily accomplished in a spreadsheet by setting up one ordered column for the age of entry, death, or censoring, and a second coding column telling what the event recorded was. One then creates a third column of units at risk by adding one for each entry and subtracting one for each death or censoring event. Follow this by using the Kaplan–Meier PL estimation techniques (beginning at a specified age that has a reasonable sample size). Unfortunately, there is no way to enter data into Minitab or JMP and perform such conditional survival analysis. EXAMPLE 7.3 LEFT-TRUNCATED DATA A manufacturer sells a flashlight with a battery that is guaranteed for continuous operation for a maximum of 60 days. To start operating the light, the user removes an insulation strip between the battery and the circuit contact. After 60 days, the light output may decrease to an unsatisfactory level. An output 50% below the original is considered a failure. The manufacturer is interested in the reliability of the light for use beyond 60 days. The manufacturer has 20 units available for analysis, which are returned from the field with various ages of operation, nearly all beyond 60 days. The plan is to run these 20 units for an additional 60 days and determine the failure distribution. The results of the additional 60-days run on these 20 units are tabulated in Table 7.12. Estimate the cumulative failure probability based on the data.

TABLE 7.12 Table of Stress Results for 20 Units Unit Number 1 2 3 4 5 6 7 8 17 9 10 19 11 18 12 13 14 15 16 20

Existing Age (days)

Age of Failure or Censor (days)

58 60 63 66 67 68 70 73 76 83 86 89 90 94 97 100 102 109 110 111

61 62 123 117 127 80 130 115 136 146 105 149 150 154 135 160 162 153 170 171

Code (1 = Failure, 0 = Censor) 1 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0

215

Analysis of Multicensored Data

Solution Because we do not know the number of units that failed prior to 58 days and whether all the units in the sample had at least 58 days of operation to enter the study, we have left-truncated data. If we were to begin our estimation of the survival distribution with the first two units, we see that the small sample size would lead to obvious estimation difficulties. The first unit with 58 days failed at 61 days and the second unit with 60 days failed at 62 days. Hence, for such a small sample size, using the PLE, the probability of surviving 61 days would be 1/2, and the probability of surviving 62 days or greater would be zero. Any further estimates using the PLE would result in 0% survival probabilities. This result is clearly misleading because we see that most of the data involves lifetimes greater than 60 days. Consequently, we will condition the results based on left truncation at 60 days, that is, the analysis will be conditioned on survival beyond 60 days. The first step is to determine the number of units at risk versus age. To do so, we set up a spreadsheet in which we list in one column the time-ordered ages of entry, failure, or censoring, coded by E, D, or C, respectively, which we enter into a second adjacent column. In a third column, we determine the number of units at risk by incrementing the total by one for each entry and decreasing the total by one for each failure or censored observation. A partial table is shown in Table 7.13. Note that we start counting the number at risk at 63 days. Figure 7.17 is a graph of the number of units at risk versus age. We then apply the usual Kaplan–Meier PLE by estimating the probability of survival at each age by dividing the number of survivors by the number at risk and sequentially multiplying these estimates to get the cumulative survival probability, from which we obtain the conditional cumulative failure distribution as shown in Figure 7.18. Note that there is no current capability in either Minitab or JMP to perform this analysis of lefttruncated data by conditioning on a specific age. EXERCISE 7.6 Using the data in Example 7.3, estimate the cumulative failure distribution conditioned on survival to 70 days.

TABLE 7.13 Partial Table of Ordered Ages of Entry, Failure, or Censored to Determine Number at Risk Index (i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Age of Entry, Failure, or Censor (days)

Code

58 60 61 62 63 66 67 68 70 73 76 80 83 86 89

E E D D E E E E E E E D E E E

Number at Risk 1 2 1 0 1 2 3 4 5 6 7 6 7 8 9

216

Applied Reliability

Number at risk versus age

18 16 14

Number at risk

12 10 8 6 4 2 0

0

50

100 Age (days)

150

200

FIGURE 7.17 Number at risk.

Conditional failure distribution function

Estimated conditional cumulative failure probability

50%

40%

30%

20%

10%

0%

FIGURE 7.18 Cumulative failure distribution.

0

50

100 Age (days)

150

200

217

Analysis of Multicensored Data

7.5 Left-Censored Data Left-censored observations are fairly common in readout data. For example, if units are placed on stress and the first measurement after the initial time zero reading occurs at 24 hours, then all failures between 0 and 24 hours are left censored. We have shown in Section 7.2 that single left-censored data is easily handled as a form of interval data. However, there are situations in which multiple left-censored observations can occur. For example, the first readout may occur at different times on different samples. Doubly censored data, involving both leftand right-censored observations, is more frequently the situation, and we address this topic in Section 7.6. However, if the multiple left-censored data includes complete failure data on the uncensored observations, then it is possible to apply Kaplan–Meier methods by a simple procedure. A transformation on the time scale involving subtraction of the observed times from a fixed reference point can reverse the order relationships, producing a set of rightcensored observations. Effectively, we are viewing the time distribution from right to left instead of the customary left to right. A famous example of such an approach is described in the study by Ware and DeMets (1976) on baboon descent times from trees. We will describe an example in reliability to illustrate the principles involved. EXAMPLE 7.4 LEFT-CENSORED DATA In a manufacturing process, five units are randomly selected each week from production and placed together on a stress specially designed to fail all units within 24 hours. Because of limited test equipment and operator availability, the first readout for each group of five units occurs somewhere between 6 and 12 hours after the start of stress. Units are then monitored continuously to failure. Because units can fail prior to the different first readout times, left-censored observations are experienced. Of interest is the distribution of failure times based on results obtained over a period of 5 weeks. The outcome of the study is shown in Table 7.14. We begin the analysis by changing the sign of each time value, thereby converting the data to ordinary right-censored data. Then the transformed times are sorted and the Kaplan–Meier estimator is applied to determine the survival estimates. The results are shown in Table 7.15. Next, the failure probabilities F(t) are the survival estimates in reverse order and referenced to the original times as shown in Table 7.16 and plotted in Figure 7.19. EXERCISE 7.7 Recreate the estimates shown in Tables 7.15 and 7.16 from the data in Table 7.14. JMP and Minitab Analysis of Left-Censored Data To perform the analysis of the data in Table 7.14, both JMP and Minitab use a Peto– Turnbull estimator, which is briefly described in Section 7.6. The data table in JMP for the data in Table 7.14 and the output are shown in Figure 7.20. TABLE 7.14 Table of Observed Times to Failure Units on Stress 1 2 3 4 5

Week 1

Week 2

Week 3

Week 4

Week 5

13.1 6.8* 6.8* 9.1 10.5

10.2* 10.2* 10.2* 11.6 10.2*

13.0 8.7* 8.7* 12.0 18.1

9.0* 21.8 12.9 10.8 11.8

6.4* 7.0 9.9 13.3 11.9

Left-censored observations.

*

218

Applied Reliability

TABLE 7.15 Analysis of Left-Censored Data Transformed Times Sorted

Type C = Censor F = Failure

Frequency

Kaplan–Meier Survival Estimate

6.4 6.8 7.0 8.7 9.0 9.1 9.9

−21.8 −18.1 −13.3 −13.1 −13.0 −12.9 −12.0

F F F F F F F

1 1 1 1 1 1 1

1.000 0.960 0.920 0.880 0.840 0.800 0.760

10.2 10.5 10.8 11.6 11.8 11.9 12.0 12.9 13.0 13.1 13.3 18.1 21.8

−11.9 −11.8 −11.6 −10.8 −10.5 −10.2 −9.9 −9.1 −9.0 −8.7 −7.0 −6.8 −6.4

F F F F F C F F C F F C C

1 1 1 1 1 4 1 1 1 2 1 2 1

0.720 0.680 0.640 0.600 0.560 0.520 0.520 0.462 0.404 0.404 0.404 0.303 0.303

Original Times t

TABLE 7.16 Analysis of Left-Censored Data Original Times t 6.4 7.0 9.1 9.9 10.5 10.8 11.6 11.8 11.9 12.0 12.9 13.0 13.1 13.3 18.1 21.8

Failure Probability F(t) 0.303 0.404 0.462 0.520 0.560 0.600 0.640 0.680 0.720 0.760 0.800 0.840 0.880 0.920 0.960 1.000

219

Cumulative failure probability

Analysis of Multicensored Data

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10 15 Time (hours)

FIGURE 7.19 Plot of CDF estimate versus time for left-censored data.

FIGURE 7.20 JMP data table and output for left-censored data in Table 7.14.

20

25

220

Applied Reliability

FIGURE 7.21 Minitab output and graph for left-censored data in Table 7.14.

The analysis can also be done in Minitab by creating a similar table in the worksheet in which each blank cell is replaced with an asterisk “*.” The output and graph in Minitab are shown in Figure 7.21.

7.6 Other Sampling Schemes (Arbitrary Censoring: Double and Overlapping Interval Censoring)—Peto–Turnbull Estimator So far, we have discussed the analysis of multicensored data, both exact and readout (nonoverlapping intervals) data including random censoring within intervals and left-censored and truncated data. However, more general forms of interval censoring can occur involving overlapping intervals or combinations of left and right (double) censoring that are not addressable by the techniques we have covered. Methods for the analysis of arbitrarily censored data were developed by Peto (1973) and Turnbull (1974 and 1976). Such estimators are often called Peto–Turnbull estimators. The theory behind such estimators is outside the scope of this text, but descriptions and examples can be found in Lawless (2003), Kalbfleisch and Prentice (2002), and Klein and Moeschberger (2003). Both JMP and Minitab are capable of analyzing such data. In fact, there is one form of arbitrarily censored data reported previously by Nelson (1990) that deserves special mention because of the reliability implications. That is the next topic of discussion. 7.6.1 Current Status Data Suppose we are interested in determining the distribution of time before initiation of some mechanism that leads to failure. For example, Nelson (1990) describes a study in which a large number of turbine wheels were inspected once to learn if a crack had begun forming. The suspicion was that the failure rate of the wheels would increase with age, and consequently, replacement of wheels would be appropriate at some age when the risk of cracking became too high. Each wheel of known age was classified as either cracked or not cracked. Then, the percent of cracked wheels among those inspected in different age intervals could

221

Analysis of Multicensored Data

provide an estimate of the time distribution to initiation of cracks. What makes this situation different from those previously considered is that the data consists only of left-censored (crack began before an inspection age) or right-censored (crack had not begun at the inspection age) data. We have no exact times to crack initiation. This type of interval censored data, involving a single inspection for the presence or absence of a characteristic, occurs commonly in medical studies and is called current status data. Note that no failures have actually occurred among the units sampled. However, a property associated with eventual failure can be detected and thereby action taken to avoid or mitigate the failure. The analysis of such data can be performed in JMP and Minitab. EXAMPLE 7.5 CURRENT STATUS DATA A manufacturer of disk drives is receiving reports of premature disk drive failures. It is suspected that a process change a year earlier may have caused the introduction of particulate contamination, which can build up on the sliding arms of computer disk drives and cause disk wear and eventual crashes. To investigate further, the manufacturer has collected 360 samples of drives in normal operation for 1–12 months, obtaining 30 samples for each month. The drives are opened (destructive inspection) and checked for the presence of any disk wear using sensitive equipment. Each drive so inspected will be classified as having evidence of wear or not. The data collected for different disk ages can then be used to estimate the time distribution at which wear begins. This is an example of current status data because each drive inspected will be classified as showing evidence of wear (left censored) or not showing (right censored). The data is shown in Table 7.17. Note that the intervals overlap. For example, the interval from 0 to 12 months encompasses all previous intervals. For drives exhibiting evidence of wear, we do not know when the evidence started, but only that it happened sometime in the 12-month period (left censored). For each interval, the CDF estimate is simply the binomial estimate based on the number of units showing evidence of wear divided by the number of drives measured for that interval (here 30). Binomial pointwise confidence limits (using the exact formula) are also shown in the table. The results are plotted in Figure 7.22. Because of the estimation method, we see that the CDF estimate is not monotonically increasing, which is a requirement. Thus, we have to use a method that preserves the properties of a CDF. The Peto–Turnbull estimate used in the Minitab and JMP approach, shown next, has this property. TABLE 7.17 Disk Drive Data Months in Field 1 2 3 4 5 6 7 8 9 10 11 12

Evidence of Wear

No Evidence

Cumulative Percentage with Evidence of Wear

2 2 6 4 7 13 14 12 9 13 15 17

28 28 24 26 23 17 16 18 21 17 15 13

6.7 6.7 20.3 13.3 23.3 43.3 46.7 40.0 30.0 43.3 50.0 56.7

Binomial 95% Confidence Limits 0.8 0.8 7.7 3.8 9.9 25.5 28.3 22.7 14.7 25.5 31.3 37.4

22.1 22.1 38.6 30.7 42.3 62.6 65.7 59.4 49.4 62.6 68.7 74.5

222

Applied Reliability

Disk drive data

80 70

CDF (%)

60 50 40

CDF estimate

30

95% LCL

20

95% UCL

10 0

0

2

4

6 8 Months in field

10

12

FIGURE 7.22 Disk drive data CDF plot.

FIGURE 7.23 Current status data table and analysis output in JMP.

Current Status Data Analysis in JMP and Minitab The JMP data entry table is shown in the left half of Figure 7.23. Note that the left-censored observations are entered with a blank cell in the Start column and the interval end in the End column. The right-censored observations are entered with the month end in the Start column and a blank

223

Analysis of Multicensored Data

(a)

(a) FIGURE 7.24 Current status data table and analysis output in Minitab.

(b)

(b)

cell in the End column. Running the Survival/Reliability platform produces the output as shown in the right half of Figure 7.23. In Minitab, the data is entered similarly into a worksheet. Running the Distribution Analysis (Arbitrary Censoring), Nonparametric Distribution Analysis… produces the output and graph as shown in Figure 7.24. The Turnbull estimate provides a monotonically increasing CDF. EXERCISE 7.8 Create a single plot that shows both the CDF estimate in Table 7.17 and the Turnbull estimate in either Figure 7.23 or Figure 7.24.

7.7 Simultaneous Confidence Bands for the Failure Distribution (or Survival) Function We often see the estimated cumulative failure distribution (CDF) curve graphed with confidence intervals. It is not unusual for the confidence intervals to be connected together to form a region containing, for a specified confidence level, the entire CDF curve. However, these confidence intervals are typically pointwise, that is, they are applicable for the failure distribution only at a single point. These intervals are not all simultaneously valid for every estimated point of the CDF curve. We seek “confidence bands” that hold for all

224

Applied Reliability

estimates of the CDF over a defined region. At a given confidence level, these simultaneous confidence bands capture the failure distribution within the band and allow us to check on the suitability of models. We want to find two random functions LCB(t) and UCB(t) such that

1 − α = Pr[LCB(t) ≤ F(t) ≤ UCB(t)]

for all tL ʺ t ʺ tU . Here, [LCB(t), UCB(t)] is called a 100 (1 – α)% simultaneous confidence band for F(t). When we have censored observations, the confidence band for F(t) will be wider than the Kolmogorov theory band (see Chapter 6) for uncensored data, at all values of t greater than the earliest censored time. Also, the band will grow wider with time as more censored observations are encountered. We will discuss four approaches for constructing such bands: Hall and Wellner (1980) bands; Nair (1984) equal probability (EP) bands; Hollander, McKeague, and Yang (1997) likelihood ratio-based bands; and Efron (1979) bootstrap bands. 7.7.1 Hall–Wellner Confidence Bands We introduced in Section 7.1 the Greenwood formula for the variance of the Kaplan–Meier estimate as

[ Fˆ (t)] = [1 − Fˆ (t)]2 Var

i

dj

∑ n (n − d ) j

j=1

j

j

for i such that ti ʺ t, where ni is the number of units at risk just before di failures at time ti. The standard error is

[ Fˆ (t)] = Var [ Fˆ (t)] se

Concerning notation, in the original Hall–Wellner paper, the development of the bands is made in terms of the survival function S(t), which is equivalent to the reliability function R(t) = [ Fˆ (t)] = se [1 − Fˆ (t)]. We shall express the Hall–Wellner (H-W) bands in terms 1 – F(t). Note that se of the CDF. For starting sample size N, Hall and Wellner define the terms CN (t) and K N (t) as:

[ Fˆ (t)] se CN (t) = N 1 − Fˆ (t)

K N (t) =

2

i

=N

dj

∑ n (n − d ) j=1

j

j

j

CN (t) 1 + CN (t)

−1

We see that 1 − K N (t) = [ 1 + CN (t)] . Hall and Wellner then show that asymptotically as N → ∞ the bands formed by

d 1 − Fˆ (t) FˆN (t) ± N ,1−α N 1 − K N (t)

contain the population F(t) with probability 1 – α. Here, dN ,1−α is a critical value based on Kolmogorov statistics obtainable from Table 7.18. The theory is valid for the maximum

225

Analysis of Multicensored Data

time tmax set equal to the next-to-longest failure time instead of the longest observed failure time. Furthermore, as we discussed in Section 7.1, log–log or logistic transformations can be applied to improve this linear formula and assure that the confidence bands do not fall outside the limits [0,1]. The H-W bands become the usual Kolmogorov bands when all observations are uncensored. In the absence of censoring, [1 − Fˆ (ti )]/[1 − K N (ti )] equals unity. As censoring occurs, this ratio increases. This ratio gives us an idea of the effect of censoring that the H-W bands represent over the Kolmogorov-type bands. In effect, the H-W bands provide progressive multipliers to the Kolmogorov bands to account for censoring. According to Nair (1984), the H-W confidence bands are valid for N as low as 25 and up to 50% censoring. Also, Hall and Wellner state that their method produces conservative bands. They provide a table of smaller critical values for dN ,1−α when the term K (tmax ) < 0.75, which occurs if the estimated CDF at tmax < 0.60 (see Table 7.19). Although the notation can be a bit confusing, the analysis and construction of the bands can be done fairly easily in a spreadsheet program as shown in Example 7.6. Further discussion of H-W bands can be found in Harris and Albert (1991). TABLE 7.18 Percentiles of Distribution of Kolmogorov

dN ,1−α Statistics N

1−α N

0.80

0.85

0.90

0.95

0.99

5 10 20

0.45 0.32 0.23

0.47 0.34 0.25

0.51 0.37 0.26

0.56 0.41 0.29

0.67 0.49 0.35

0.21 0.19 0.18 0.17 0.16 0.15 1.07/ N

0.22 0.20 0.19 0.18 0.17 0.16 1.14/ N

0.24 0.22 0.20 0.19 0.18 0.17 1.22/ N

0.26 0.24 0.23 0.21 0.20 0.19 1.36/ N

0.32 0.29 0.27 0.25 0.24 0.23 1.63/ N

25 30 35 40 45 50 For larger values

Source: Massey, F. J., Jr. 1951. Journal of the American Statistical Association 46:253. Adapted from Table 1, p. 70. Reprinted with permission from The Journal of the American Statistical Association. Copyright 1951 by the American Statistical Association. All rights reserved.

TABLE 7.19 Critical Values of dN ,1−α for H-W Confidence Bands When K N (tmax ) < 0.75 K N (t max ) 1–α

0.10

0.25

0.40

0.50

0.60

0.75

0.90

1.00

0.99 0.95 0.90 0.75 0.50

0.851 0.682 0.599 0.471 0.356

1.256 1.014 0.894 0.711 0.544

1.470 1.198 1.062 0.854 0.663

1.552 1.273 1.133 0.920 0.720

1.600 1.321 1.181 0.967 0.765

1.626 1.354 1.217 1.008 0.809

1.628 1.358 1.224 1.019 0.827

1.628 1.358 1.224 1.019 0.828

Source: Hall, W. J., and J. A. Wellner. 1980. Biometrika 67:141.

226

Applied Reliability

EXAMPLE 7.6 ESTIMATING F(t) WITH H-W CONFIDENCE BOUNDS A 1000-hour stress is run on 50 units to estimate the failure distribution F(t) for a particular failure mechanism. By the end of test, 32 failures are observed. However, when failure analysis is completed on the units, it is discovered that six failures were the result of a different, independent, competing mechanism. Consequently, the times of failure for the second mechanism will be treated as censoring times for the first (primary) mechanism under study. The times to failure and censoring times are shown in Table 7.20. The spreadsheet calculations needed for determining the H-W confidence bands are shown in Figure 7.25 for the first 18 observations. Completion of the spreadsheet for the remaining times to failure is left as an exercise. Based on the last column in Figure 7.25, we can now estimate the H-W confidence bands. From Table d 7.18, for a 90% confidence level, the Kolmogorov statistic N,1−α = 0.17. However, for the next-toN longest failure time tmax = 843, it can be shown (see Exercise 7.6) that KN (tmax ) = 0.5467, which is less than 0.75. Consequently, using Table 7.19, we interpolate to find that dN,1−α = 1.1554/ 50 = 0.1634, N which is very close to the value of 0.17 in Table 7.18. The H-W 90% confidence bands are shown in Table 7.21, along with the confidence bands obtained by both log and logit transformation. A plot of the CDF estimate and the different H-W confidence bands are shown in Figure 7.26. Note that the untransformed H-W LCB limits have been set equal to zero for values less than zero. Also, we see that the untransformed bands (H-W LCB and H-W UCB) are more conservative than the transformed bands and that both log and logistic transformations produce similar bands. EXERCISE 7.9 Create a spreadsheet as shown in Figure 7.25 and complete the analysis for the remaining rows in Table 7.20. Determine K N (tmax ).

TABLE 7.20 Failure and Censor Times for Primary Mechanism (N = 50) Time i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Frequency

Time

Frequency

ti

Type

di

i

ti

Type

di

33 62 86 138 141 162 168 201 222 228+ 281 306+ 355 367 374 452

F F F F F F F F F C F C F F F F C

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

486 547 562 630 678 713+ 734 736 738 742 772 843 871 877+ 975+ 1000+

F F F F F C F F F F F F F C C C

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 18

461+

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

ti

Type

At Risk ni

33 62 86 138 141 162 168 201 222 228+ 281 306+ 355 367 374 452 461+ 486

F F F F F F F F F C F C F F F F C F

50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33

Freq. di 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1

Cumulative

(ni − di)/ni

KM Surv. 1 − F(ti)

CDF F(ti)

(di/(ni(ni − di)))

(di/(ni(ni − di)))

StdErr(F(ti))

CN

KN

0.9800 0.9796 0.9792 0.9787 0.9783 0.9778 0.9773 0.9767 0.9762 1.0000 0.9750 1.0000 0.9737 0.9730 0.9722 0.9714 1.0000 0.9697

0.9800 0.9600 0.9400 0.9200 0.9000 0.8800 0.8600 0.8400 0.8200 0.8200 0.7995 0.7995 0.7785 0.7574 0.7364 0.7153 0.7153 0.6937

0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600 0.1800 0.1800 0.2005 0.2005 0.2215 0.2426 0.2636 0.2847 0.2847 0.3063

0.00041 0.00043 0.00044 0.00046 0.00048 0.00051 0.00053 0.00055 0.00058 0.00000 0.00064 0.00000 0.00071 0.00075 0.00079 0.00084 0.00000 0.00095

0.00041 0.00083 0.00128 0.00174 0.00222 0.00273 0.00326 0.00381 0.00439 0.00439 0.00503 0.00503 0.00574 0.00649 0.00729 0.00813 0.00813 0.00907

0.0198 0.0277 0.0336 0.0384 0.0424 0.0460 0.0491 0.0518 0.0543 0.0543 0.0567 0.0567 0.0590 0.0610 0.0629 0.0645 0.0645 0.0661

0.0204 0.0417 0.0638 0.0870 0.1111 0.1364 0.1628 0.1905 0.2195 0.2195 0.2516 0.2516 0.2871 0.3247 0.3643 0.4064 0.4064 0.4537

0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600 0.1800 0.1800 0.2010 0.2010 0.2231 0.2451 0.2670 0.2889 0.2889 0.3121

[1 − F(ti)]/ 1 − KN (1 − KN) 0.9800 0.9600 0.9400 0.9200 0.9000 0.8800 0.8600 0.8400 0.8200 0.8200 0.7990 0.7990 0.7769 0.7549 0.7330 0.7111 0.7111 0.6879

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0006 1.0006 1.0020 1.0033 1.0047 1.0060 1.0060 1.0084

Analysis of Multicensored Data

Time

FIGURE 7.25 Spreadsheet showing the calculations for determining the Hall–Wellner confidence bands for the first 18 observations in Table 7.20.

227

228

Applied Reliability

TABLE 7.21 Hall–Wellner 90% Confidence Bands—Untransformed and with Log and Logit Transformations H-W LCB

H-W UCB

Log trf d

H-W LT LCB

H-W LT UCB

Logit trf w

H-W LG LCB

LG UCB

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.017 0.017 0.037 0.037 0.058 0.079 0.099 0.120 0.120 0.142

0.183 0.203 0.223 0.243 0.263 0.283 0.303 0.323 0.343 0.343 0.364 0.364 0.385 0.407 0.428 0.449 0.449 0.471

0.141 0.250 0.322 0.375 0.416 0.449 0.476 0.500 0.520 0.520 0.537 0.537 0.553 0.566 0.579 0.590 0.590 0.600

0.003 0.010 0.020 0.031 0.043 0.056 0.069 0.083 0.098 0.098 0.113 0.113 0.129 0.146 0.162 0.179 0.179 0.197

0.134 0.151 0.175 0.199 0.224 0.248 0.271 0.295 0.317 0.317 0.341 0.341 0.364 0.388 0.411 0.433 0.433 0.456

7.242 4.114 3.213 2.778 2.519 2.347 2.223 2.130 2.057 2.057 2.000 2.000 1.955 1.918 1.886 1.860 1.860 1.839

0.003 0.010 0.019 0.030 0.042 0.055 0.068 0.082 0.096 0.096 0.111 0.111 0.127 0.143 0.160 0.176 0.176 0.194

0.129 0.146 0.170 0.195 0.219 0.242 0.266 0.289 0.311 0.311 0.334 0.334 0.357 0.380 0.403 0.425 0.425 0.448

0.8 0.7

Failure probability

0.6 0.5 0.4 0.3

F(ti) H-W LCB H-W UCB

0.2

H-W LT LCB H-W LT UCB

0.1 0

H-W LG LCB H-W LG UCB

0

200

400

600 Time (hours)

800

1000

1200

FIGURE 7.26 Kaplan–Meier F(t) estimate and Hall–Wellner 90% confidence bands: linear (H-W), log (H-W LT), and logit (H-W LG) transformations.

Analysis of Multicensored Data

229

7.7.2 Nair Equal Precision Confidence Bands In a study published in 1984, Nair describes equal precision simultaneous confidence bands, which are obtained for large samples by adapting Greenwood's variance formula and the normal approximation. Nair compares the performance of the EP bands to the Hall–Wellner bands in his study and finds the approaches competitive. The EP band is valid only in a restricted region, and the user has to decide on the range. The linear normal approximation used in the Nair bands does not do as well as a logistic transformation and actually performs poorly when the sample size is small (see Weston and Meeker 1991). The application of the Nair approach is described in Meeker and Escobar (1998). 7.7.3 Likelihood Ratio-Based Confidence Bands In 1997, Hollander, McKeague, and Yang published a paper describing the creation of simultaneous confidence bands based on a likelihood ratio (LR) approach for confidence intervals first introduced by Thomas and Grunkemeier (1975). Hollander et al. recommend the LR confidence bands because the coverage accuracy is satisfactory, the boundaries are naturally contained in [0,1], and the bands are preserved under transformations. In a simulation study, the LR bands perform quite well compared with the H-W bands and outperform the EP bands to a great extent. Because the H-W bands are currently the most popular and the development of the LR method is relatively recent, the LR bands will not be covered here. Computer programs are required for analysis. Because the H-W bands are easily calculated using spreadsheets and reduce to the well-known Kolmogorov bands under no censoring, we have focused our attention on the H-W bands. 7.7.4 Bootstrap Methods for Confidence Bands Efron developed the methodology of bootstrapping in 1979. Because of very fast computers with excellent random number generation capabilities, bootstrapping is very applicable for solving problems in statistical analysis. The idea is simple. From a data set of N observations, we randomly sample N values and use the randomly sampled observations to estimate a desired parameter by whatever statistical procedure is appropriate. Then, we repeatedly resample N observations many times (e.g., 250 or more repetitions), each time estimating the desired parameter. Note that each observation in the original data set may be included in the randomly sampled group one or more times or not at all. The distribution of the parameter estimates so obtained by resampling is the basis for the empirical calculation of standard errors, confidence limits, or specific percentiles. Bootstrapping parameter estimates have been shown to asymptotically converge to the target value. Also, bootstrap nonparametric confidence bands seem to be more accurate than asymptotic approaches, as described by Akritas (1986). The extension of bootstrapping to the estimation of confidence bands for the Kaplan– Meier estimator for censored survival data was developed by Akritas (1986). We shall follow the summary of the methodology for calculating bootstrap confidence bands as described by Harris and Albert (1991). For censored data, Efron’s (1981) approach is to resample the pairs (tj,δj) from the observed data where tj is the time of failure or censoring and δj is the associated indicator variable identifying the observation as either failure (δj = 1) or censored value (δj = 0). Consider N pairs randomly drawn with replacement from the original data set of N observations. In Efron’s procedure, let m∗j denote the number of times the original data pair (tj,δj) appears in the bootstrap sample. For example, the first pair (t1 , δ 1 ) might appear twice

230

Applied Reliability

in a particular bootstrap sample, and so m1∗ = 2 . The second pair (t2 , δ 2 ) may not appear at all, and thus, m2∗ = 0. Efron defines

N

M ∗j ≡ ∑ mi∗     j = 1, 2 , ... , N i= j

So M1∗ = N as we sum over all randomly drawn N observations. Then, M2∗ = N − m1∗ because the summation starts at j = 2, and so on. Efron states that the Kaplan–Meier CDF curve based on the bootstrap data is

N ⎛ m*j ⎞ Fˆ * (t) = 1 − ∏ ⎜ 1 − * ⎟ Mj ⎠ i= j ⎝

δj

Suppose we now repeat resampling and estimation of the failure curve say 200 to 300 times, and for each bootstrap sample, we compute the statistic

1 − K N (t) max FˆN* (t) − FˆN (t) . t ≤tmax 1 − FN (t)

where K N (t) was previously defined in connection with the Hall–Wellner bands. The collection of 200 to 300 of these statistics can be ordered by magnitude, allowing us to determine the 100 (1 – α) percentile of the distribution of these statistics, which we designate by c N . According to Akritas (1986),

c [1 − FˆN (t)] FˆN (t) ± N t ≤ tmax 1 − K N (t)

form a 100 (1 – α) percent confidence band for F(t), where tmax is the next-to-last longest failure time observed. Obviously, the bootstrap approach involves fairly extensive computation, and consequently, computer programs are required to perform the simulation and calculation of the statistics. However, the bootstrap confidence bands can be narrower than the Hall–Wellner bands and valid with small sample sizes. In the upper end of the failure curve with heavy censoring, both bands may be so wide as to be practically of little value. An example of how bootstrap confidence bands can be obtained using Excel spreadsheet functions is shown in Appendix 7A. Further discussion of bootstrap confidence bands is presented in Harris and Albert (1991). 7.7.5 Confidence Bands in Minitab and JMP Minitab 15 and earlier versions do not provide the capability to generate simultaneous confidence bands. JMP 7 and later versions do provide an option in the Survival/Reliability platform to plot Nair EP 95% simultaneous confidence bands as part of the nonparametric Kaplan–Meier failure function plot. See Figure 7.27. However, JMP does not provide a way to create an output data table that can show the actual results of the calculations. EXERCISE 7.10 For the data in Table 7.20, create 90% bootstrap confidence bands.

231

Analysis of Multicensored Data

FIGURE 7.27 JMP output showing Nair 95% EP confidence bands.

7.8 Cumulative Hazard Estimation for Exact Failure Times An alternative procedure for handling multicensored data using the cumulative hazard function is described by Nelson (1982). This technique involves estimating the cumulative hazard function at each failure time and then plotting the points on cumulative hazard paper. An extensive treatment of this topic is given by Nelson (1972). Properties of Nelson’s method were developed by Aalen (1978), and consequently, the procedure has been called the Nelson–Aalen estimator. Recall from Chapter 2 that the cumulative hazard function H(t) of any distribution is related to the CDF, F(t), by

H (t) = − ln[1 − F(t)]

232

Applied Reliability

TABLE 7.22 Cumulative Hazard Calculation Time of Failure t1 t2 t3

Number at Risk

Number of Failures

Hazard Value

n1 n2 n3

d1 d2 d3

h1 = d1/n1 h2 = d2/n2 h3 = d3/n3

Cumulative Hazard Estimate H1 = 100h1 H2 = H1 + 100h2 H3 = H2 + 100h3

⫶

⫶

⫶

⫶

⫶

tu

nu

du

hu = du/nu

HU = HU−1 + 100hU

and the derivative of the cumulative hazard function is the hazard rate h(t), that is,

h(t) =

dH (t) f (t) = dt 1 − F(t)

First we need a method for estimating H(t). At each time t, individual hazard terms are estimated by dividing the number of failures by the number of units surviving just before time t. H(t) is then estimated by the cumulative sum of these terms. Expressed as a formula, the cumulative hazard is estimated for ti ʺ t as

i

dj

∑n

Hˆ (t) =

j=1

j

where ni is the number of units at risk just before di failures at time ti. For exact, unique times to failure, the number of failures di at each failure time ti is always one. Thus, a simple procedure is to order the failure times from lowest to highest, and then associate with each failure time the reverse rank starting with the initial sample size. For example, with 20 units, we have the first failure match with 20, the second with 19, and so forth. Then for each time, the cumulative hazard function estimate is just the sum of the reciprocals of the reverse ranks to that time because there is one failure for each of the reverse ranks. Each reciprocal calculated this way is called a “hazard value.” The method of estimation is detailed in Table 7.22. The estimated variance of the Nelson–Aalen estimator, due to Aalen (1978) is

i

dj

∑n

[ Hˆ (t)] = 10, 000 Var

j=1

2 i

for ti ≤ t

The factor 10,000 is needed because the cumulative hazard was calculated in percent units. EXAMPLE 7.7 CUMULATIVE HAZARD PLOTTING Using the exact failure time data in Figure 7.1, calculate the cumulative hazard estimates at each failure time.

Solution The above cumulative hazard values could now be used in a hazard plot to check against a specific distribution type. Also, the slope of the estimated cumulative hazard curve can provide crude estimates of the hazard rate h(t). EXERCISE 7.11 Estimate the variance of the cumulative hazard estimates in Table 7.23 at each failure time.

233

Analysis of Multicensored Data

TABLE 7.23 Cumulative Hazard Calculation for Exact Failure Times Example Failure Times 200 300 400 500 600

Number on Test

Number of Failures

Hazard Value

Cumulative Hazard Value

8 6 5 3 2

1 1 1 1 1

1/8 = 0.125 1/6 = 0.167 1/5 = 0.200 1/4 = 0.250 1/5 = 0.500

12.5 29.2 49.2 82.5 132.5

7.9 Johnson Estimator In 1964, Johnson published a monograph in which he described a method for the analysis of incomplete tests, that is, data subject to multicensoring. This procedure has an advantage in that median ranks can be easily determined for each failure time based on what Johnson refers to as an adjusted order number for each failure. Consider a simple experiment in which five items are placed on test. There are three failures occurring at times 10, 22, and 45 hours. At times 15 and 37 hours, unfailed units are removed from the test for some reason, perhaps destructive analysis to investigate the progression of a failure mechanism. In any case, these two units are censored observations, which Johnson calls suspensions. Table 7.24 summarizes the results. The first failure F1 at 10 hours is given the failure order number 1. This failure is followed by a suspension S1. The second failure F2 occurs at 22 hours. Its order number would be 2 only if we assume that the suspension S1 at 15 hours would have failed after 22 hours had it not been censored. However, there is also the possibility that this suspension could have failed prior to 22 hours, in which case the order number for the second observed failure F2 would be 3. Consider now the third failure F3 at 45 hours. If both suspensions S1 and S2 are assumed as possibly failing before 45 hours, the order number for F3 would be 5. However, if only the first suspension S1 is assumed to fail at a time greater than 45 hours, then the third failure F3 would have an order time of 4. If both suspensions are assumed to fail after 45 hours, then the third failure F3 has an order number of 3. We have to consider how many different scenarios can occur for the n = 5 items. Labeling the failure of the first suspension S1 as A and the failure of the second suspension as B, we obtain the eight possible outcomes shown in Table 7.25. Consider the order number of F2 in Table 7.25. In the first outcome, its order number is 3. In the third outcome, the order number is 3. For F2, the eight possible order numbers are 3, 3, 2, 2, 2, 2, 2, and 2, leading to an average of 18/8 = 2.25. The eight possible order numbers for the third failure F3 are 5, 4, 5, 4, 4, 5, 3, and 3, with an average value 33/8 = 4.125. Hence, the three failures are assigned the mean order numbers, and by use of the spreadsheet inverse beta function = BETAINV(0.5, r, n−r+1), where r is the mean order number, median ranks can be determined as shown in Table 7.26. Similarly, pointwise 100(1 − α)% confidence intervals can be estimated using the beta inverse function with the first argument set to α/2 or 1 − α/2. Obviously, the Johnson method requires considerable enumeration of possible scenarios if done manually. However, Johnson derives a simple formula to estimate the mean order

234

Applied Reliability

TABLE 7.24 Multicensored Results Item Number i

Time (hours)

Event

10 15 22 37 45

F1 S1 F2 S2 F3

1 2 3 4 5 F = Failure S = Suspension

TABLE 7.25 Possible Outcomes for n = 5, FSFSF, Assuming Possible Eventual Suspension Failures 1

2

3

4

5

6

7

8

F1 A F2 B F3

F1 A F2 F3 B

F1 F2 A B F3

F1 F2 A F3 B

F1 F2 B F3 A

F1 F2 B A F3

F1 F2 F3 A B

F1 F2 F3 B A

TABLE 7.26 Mean Order Numbers and Median Ranks Failure Time 10 22 45

Mean Order Number

Median Rank

1 2.25 4.125

0.129 0.360 0.709

number. At each failure following one or more suspensions, an increment to the previous order number is given by the formula for n items

New increment =

(n + 1) − ( previous mean order number ) 1 + (number of failed or suspended items beyond current suspension set)

If we designate the mean order number for the failure at the ith item number by Oi, then we can write the formula for Oi as

Oi = Oi−1 +

(n + 1) − Oi−1 n−i+2

For example, referring to Table 7.25, we see that F1 has mean order number 1 because no suspensions precede it, F2 has mean order number 1 + [(5 + 1) – 1]/(2 + 2) = 2.25, and F3 has mean order number 2.25 + [(5 + 1) – 2.25]/(0 + 2) = 4.125, in agreement with previous results. This formula can be modified by recognizing that the term n–i+1 represents the reverse

235

Analysis of Multicensored Data

TABLE 7.27 Mean Order Numbers Using Johnson Formula Item Number i 1 2 3 4 5

Event

Reverse Rank

F1 S1 F2 S2 F3

5 4 3 2 1

Mean Order Number Oi 1 2.25 4.125

rank of the observations. That is, for observations ranked 1, 2, …, n, the reverse ranks are n, n – 1, …, 1, respectively. Consequently, the formula can be rewritten as

Oi =

(ith reverse rank )Oi−1 + (n + 1) (n−i+1)Oi−1 + (n + 1) = (n−i+1) + 1 (ith reverse rank ) + 1

The results are shown in Table 7.27. Wasserman (2003) and Wasserman and Reddy (1992) prefer the formula based on inverse ranks, also suggested by Abernethy (1996). The Johnson estimator has, in addition, been advocated by O’Connor (2002) and Kapur and Lamberson (1977). Minitab does make this useful technique available as an option for calculating probability plots. EXERCISE 7.12 Compare Johnson estimate to the median rank estimate for the data in Table 7.1.

Summary This chapter has reviewed Kaplan–Meier PL estimation procedures for the analysis of multicensored data. We have presented examples of censored and truncated data. We have described the estimation of standard errors, pointwise confidence intervals, and simultaneous confidence bands. We have provided illustrations of the use of JMP and Minitab for solving problems involving arbitrary censoring. We have also discussed alternative approaches for the analysis of multicensored data. These methods are very powerful and should be in the toolkit of anyone doing statistical analysis of survival or reliability data.

Appendix 7A 7.1A Obtaining Bootstrap Confidence Bands Using a Spreadsheet EXAMPLE 7.1A BOOTSTRAP CONFIDENCE INTERVAL CALCULATION We will use the data from Figure 7.1, consisting of eight units with failures at 200, 300, 400, 500, 600, and 700 hours and two censored observations at 250 and 450 hours. The original data, along with the PL CDF estimate, standard error, and Hall–Wellner terms are shown in Table 7.1A. Type is the indicator for failure (1) or censor (0).

236

Applied Reliability

TABLE 7.1A Original Data, CDF Estimate, Standard Error, and Hall–Wellner Terms Index j 1 2 3 4 5 6 7 8

Time tj

Type δj

FN(t)

seFN(t)

CN(t)

1 – KN(t)

200 250 300 400 450 500 600 700

1 0 1 1 0 1 1 1

0.1250 0.1250 0.2708 0.4167 0.4167 0.6111 0.8056 0.8056

0.1169 0.1169 0.1650 0.1856 0.1856 0.2013 0.1704 0.1704

0.1429 0.1429 0.4095 0.8095 0.8095 2.1429 6.1429 0.0000

0.8750 0.8750 0.7095 0.5526 0.5526 0.3182 0.1400 0.0000

TABLE 7.2A One Bootstrap Run of Data in Table 7.1A Index j 6 3 3 2 6 1 7 6

Time tj*

Type δj*

Index i

mi*

Mj*

FN*(t)

|FN*(t) − FN(t)|

|FN* − FN|− (1 − KN)/(1 − FN)

500 300 300 250 500 200 600 500

1 1 1 0 1 1 1 1

1 2 3 4 5 6 7 8

1 1 2 0 0 3 1 0

8 7 6 4 4 4 1 0

0.1250 0.2500 0.5000 0.5000 0.5000 0.8750 1.0000

0.0000 0.1250 0.2292 0.0833 0.0833 0.2639 0.1944

0.0000 0.1250 0.2230 0.0789 0.0789 0.2159 0.1400

Max 0.2230

Table 7.2A shows the results of one bootstrap run. In the first column, we use the spreadsheet function RANDBETWEEN(1,8) to randomly generate eight index values. Note that some values are repeated. We use these values to refer to the index values in the first column of Table 7.1A and select the time and type corresponding to these randomly selected index values. Thus, in Table 7.2A, the first index is 6, which corresponds to time 500 and type 1(failure). The spreadsheet function that does the matching is OFFSET(reference cell, row number up or down, column number left or right). Thus, if the Time column header in Table 7.1A is cell B2, then OFFSET(B2, 6, 0) will return the value 500 that is six rows below and in the same column as B2 and OFFSET(B3, 6, 0) will return the corresponding index value “1.” The fourth column in Table 7.2A contains the ordered indices. The mi* column counts the number of times the ordered row index appears in the first column using the COUNTIF(range,value) function. The Mj* column starts with N = 8 and decreases with each matching mi*. Then, the bootstrap failure curve FN*(t) follows from the Efron formula previously described in Section 7.7.4. The absolute value of the difference between the original CDF estimate and the bootstrap failure curve forms the next column, which is used in the next-to-last-column. Finally, the last column is the maximum value of the next-to-last column. We now repeat this process 250 times (i.e., we bootstrap) and obtain 250 values of the maximum for each run. The 1 − α percentile of these results generates the cN that we use in the Akritas formula for the confidence bands. To actually generate the 250 values, we use a technique involving data tables described in Chapter 15 by Winston (2007). In Excel 2007, we assign a cell in a spreadsheet to refer to the value in the maximum cell. Then, to the left and below this cell, we fill in 250 rows with the values 1 through 250, as partially shown in Figure 7.1A.

237

Analysis of Multicensored Data

0.2230 1 2 3 4 5 6 7 8 9 10 FIGURE 7.1A Partial table for capturing “max” of bootstrap runs.

FIGURE 7.2A Dialog box for creating a data table.

90th percentile 1 2 3 4 5 6 7 8 9 10

0.2500 0.2500 0.2763 0.2684 0.1974 0.3182 0.1974 0.3158 0.2500 0.3158 0.1419

0.3947

FIGURE 7.3A One-way data table with varying max values.

Next we highlight the assigned cell and the two columns and 250 rows below. Then, we select the Data tab, and under the menu item What-If-Analysis, we choose Data-Table… to obtain the dialog box as shown in Figure 7.2A. We leave the Row input cell empty and click any blank cell in the spreadsheet for the Column input cell. Clicking OK produces the one-way data table partially as shown in Figure 7.3A.

238

Applied Reliability

Times tj

FN(t)

LCL

UCB

200 250 300 400 450 500 600

0.1250 0.1250 0.2708 0.4167 0.4167 0.6111 0.8056

0.0000 0.0000 0.0000 0.0000 0.0000 0.1287 0.2574

0.5197 0.5197 0.6765 0.8333 0.8333 1.0000 1.0000

FIGURE 7.4A Bootstrap 90% confidence bands.

1 0.9 0.8

Failure probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

100

200

300 400 Time (hours)

500

600

700

FIGURE 7.5A CDF estimate and 90% bootstrap confidence bands.

Then to find the 90th percentile from the column containing the max values in the data table, we use the spreadsheet function PERCENTILE(data table column, 0.90). This 90th percentile is the cN value we use in the bootstrap confidence band formula. Here, for a bootstrap study of 250 runs, the 90th percentile value is 0.3947, producing the confidence bands as shown in Figure 7.4A and as graphed in Figure 7.5A. Note that the bands are very wide because of the small sample size used. The bands are valid up to the second-to-last failure time, that is, 600 hours. The limits have been truncated to be between 0 and 1. A log or logistic transformation may also be applied instead of the linear formula used, as discussed in Section 7.7.1 in this chapter. EXERCISE 7.1A For the data in Table 7.20, create 90% bootstrap confidence bands using a spreadsheet program.

Analysis of Multicensored Data

239

Problems 7.1 Ten units were placed on stress for 1000 hours. At 250, 500, and 750 hours, 1 randomly chosen surviving unit was removed for physical analysis. There were five failure times observed at 41, 253, 441, 561, and 920 hours. At test end, two units survived. Using the Kaplan–Meier PLE, estimate the CDF at the observed failure times. Estimate the standard error of the CDF estimate using Greenwood’s formula. Provide 90% pointwise confidence intervals at all failure times using the normal approximation. Redo the confidence interval estimates by applying the log and logistic transformations. 7.2 Twenty units were placed on stress. Readouts were done at the end of each week at 168, 332, 504, 672, 840, and 1008 hours. The failure counts observed at the readouts were 0, 4, 5, 3, 2, and 2, respectively. At the end of weeks 2, 3, 4, and 5, one randomly chosen surviving unit was removed for failure analysis. Using the PLE for interval data, estimate the CDF at the observed failure times. Estimate the standard error of the CDF estimate using Greenwood’s formula. Provide 90% pointwise confidence intervals at each readout using the normal approximation. Redo the confidence interval estimates by applying the log and logistic transformations. 7.3 For the data in Table 7.10, suppose a preliminary assessment of the cumulative survival probabilities is desired after the first 5 weeks into the study. Using the data for the first five samples and only 5 weeks of stress, construct a life table estimation of the failure probabilities. Repeat the analysis, but this time use the Kaplan–Meier joint risk estimate instead of the actuarial estimate. Compare results of the two estimators. 7.4 Construct 90% Hall–Wellner simultaneous confidence bands for the failure distribution obtained in Problem 7.1. Obtain linear, log, and logistic transformation confidence bands. Plot the results. 7.5 Construct 90% Hall–Wellner simultaneous confidence bands for the failure distribution obtained in Problem 7.2. Obtain linear, log, and logistic transformation confidence bands. Plot the results. 7.6 Obtain 90% bootstrap confidence bands for the failure distribution in Problem 7.1. 7.7 Obtain 90% bootstrap confidence bands for the failure distribution in Problem 7.2. 7.8 Using the data in Problem 7.1, calculate the cumulative hazard estimates at each failure time. Estimate the standard errors of the cumulative hazard estimates. 7.9 Using the data in Problem 7.2, calculate the cumulative hazard estimates at each failure time. Estimate the standard errors of the cumulative hazard estimates. 7.10 Estimate the mean order numbers and median ranks using the Johnson estimator for the data in Problem 7.1.

This page intentionally left blank

8 Physical Acceleration Models If we have enough test data, the methods described in the last five chapters will allow us to fit our choice of a life-distribution model and estimate the unknown parameters, percentiles, percentage failing on warranty, and so on. However, with today’s highly reliable components, we are often unable to obtain enough test data when stresses approximate normal use conditions. Instead, we must resort to forcing components to fail by testing them at conditions much higher than the intended application. By testing in this way, we get failure data that can be fit to life-distribution models with relatively small test sample sizes and practical test times. We pay a price while overcoming our inability to obtain failure rates by testing directly at normal-use conditions (with realistic sample sizes and test times), and it is paid in the form of required additional modeling. How can we predict from the failure rate at high stress what a product is likely to experience at much lower stresses? The models used to bridge the stress gap are known as acceleration models. This chapter develops a general theory of these models and looks in detail at what acceleration testing with the exponential, Weibull, or lognormal model means. Several well-known and useful forms of acceleration models, such as the Arrhenius and the Eyring, are described, starting in Section 8.7. Practical use of these models to calculate acceleration factors and failure rates at use conditions is discussed and illustrated with examples. These examples, and all real-life analyses of acceleration test data, start by assuming that various models are (approximately) valid and then use the test data to estimate unknown life-distribution and acceleration model parameters. Therefore, it is important to statistically test whether the data are consistent with the chosen models, covered in Section 8.4, and it is equally important to quantify uncertainty by calculating confidence bounds for the unknown parameters, discussed in Section 8.5. By the end of this chapter, the reader will be able to set up life test experiments and use either JMP, Minitab, or Excel to analyze the resulting failure data and calculate model parameter estimates that can be used to project failure rates at use conditions. NOT E : Analysis

of life test data where stress levels are changed during the test, known as “step-stress” testing, is covered in Chapter 9. Chapter 9 also covers acceleration model analysis using parametric degradation data instead of actual failure data. For more in-depth coverage of accelerated testing, the reader can refer to Meeker and Escobar (1998) or Nelson (2004).

8.1 Accelerated Testing Theory The basic concept of acceleration is simple. We hypothesize that a component, operating under the right levels of increased stress, will have exactly the same failure mechanisms as when the component is used at normal stress. The only difference is “things happen faster.” For example, if corrosion failures occur at typical use temperatures and humidities, 241

242

Applied Reliability

then the same type of corrosion happens much quicker in a humidified laboratory oven at elevated temperatures. In other words, we can think of time as “being accelerated,” just as if the process of failing were filmed and then played back at a faster speed. Every step in the sequence of chemical or physical events leading to the failure state occurs exactly as at lower stresses; only the time scale measuring event duration has been changed. When this assumption holds over a range of stress values, we say we have simple acceleration, also known as true acceleration. From the film replay analogy, it is clear that simple acceleration is just a transformation of the time scale. Therefore, if we know the life distribution for units operating at a high laboratory stress, and we know the appropriate timescale transformation to a lower stress condition, we can mathematically calculate the life distribution (and failure rate) at that lower stress. This is the approach we will use. In theory, any well-behaved (order preserving, continuous, etc.) transformation could be a model for simple acceleration. However, in terms of practical applicability, we almost always restrict ourselves to simple constant multipliers of the time scale. When every time of failure and every distribution percentile is multiplied by the same constant acceleration factor to obtain the projected values at another operating stress, we say we have linear acceleration. Under this linear acceleration assumption, we can write

(time to failure at stress S1 ) = AF × (time to failure at stress S2 )

where AF is the constant acceleration factor relating the times to failure at the two stresses. Here, S1 represents a collection of settings of one or more different types of stresses, such as temperature, voltage, humidity, and so on; and S2 is another collection of settings for the same type of stresses. If we use subscripts to denote stress levels and let U stand for typical use levels of the stresses and let S represent a set of higher laboratory stress levels, then, assuming we have linear acceleration, the key equations in Table 8.1 hold, no matter what the underlying life distribution happens to be. In Table 8.1, tU represents a random time to failure at use conditions, while tS is the time at which the same failure would have happened at a set of higher stress conditions. Similarly, FU, f U, and hU are the cumulative distribution function (CDF), the probability density function (PDF), and the failure rate at use conditions, while FS, fS, and hS are the corresponding functions at stress level S. Equation 1 of Table 8.1 denotes the linear acceleration model assumption and all of the other equations follow directly from this equation. For example, FU(t) is the probability of failing by time t at use-stress level U, which is equivalent to failing at time (t/AF) at accelerated stress level S (by Equation 1). This relationship is stated in Equation 2 and the next two equations follow by standard transformation of variables methods. (e.g., see Mendenhall, Schaeffer, and Wackerly 1990.) Table 8.1 gives the mathematical rules for relating CDFs and failure rates from one stress to another. These formulas depend only on the assumption of simple acceleration and TABLE 8.1 General Linear Acceleration Relationships 1. Time to failure 2. Cumulative failure probability (0−t) 3. Density function 4. Instantaneous failure rate

tU = AF × tS FU(t) = FS(t/AF) fU(t) = (1/AF)fS(t/AF) hU(t) = (1/AF)hS(t/AF)

Physical Acceleration Models

243

linear acceleration factors. In the next three sections, we apply these formulas to exponential, Weibull, and lognormal life distributions. EXERCISE 8.1 Use the fact that f (t) = dF(t)/dt and h(t) = f(t)/R(t) to verify Equations 3 and 4 in Table 8.1. EXERCISE 8.2 Derive comparable equations to Equations 2, 3, and 4 from Table 8.1 for the “quadratic” acceleration model given by tU = AFtS2 . What happens to high-stress failure times under 1 hour if this kind of acceleration holds? Find the crossover time where the time to failure at use equals the time to failure at stress and shorter stress failure times correspond to even earlier use failure times. Does this model seem physically plausible to you? EXERCISE 8.3 Derive comparable equations to Equations 2, 3, and 4 from Table 8.1 under the assumption of “exponential” acceleration model, as given by tU = AFetS. Note that at use conditions, the probability of failure before time AF is zero. Does this model seem physically plausible to you?

8.2 Exponential Distribution Acceleration We use the exponential distribution with CDF FS (t) = 1 − e− λSt. In other words, times to failure at high laboratory stress S can be modeled by an exponential life distribution with the failure rate parameter λS. Using Equation 2 to derive the CDF at use conditions, we get

FU (t) = FS (t/AF) = 1 − e− λSt/AF = 1 − e− ( λS/AF )t

By substituting λU = λS/AF, we see that the CDF at use conditions remains exponential with the new parameter λS/AF. The linear acceleration model assumption implies that an exponential distribution fit at any one stress condition guarantees an exponential fit at any other stress condition (within the range where simple linear acceleration holds). Moreover, when high-stress failure time is multiplied by an acceleration factor AF to get the corresponding failure time at use-stress level, then the failure rate at use-stress level is reduced by dividing the high-stress failure rate by AF. EXERCISE 8.4 Use the time transformation in Exercise 8.3 and the assumption that the life distribution at stress S is exponential with parameter λS to derive FU(t). Is FU(t) still exponential? Next, derive hU(t). Is hU(t) still constant?

The fact that the failure rate λ of an exponential distribution varies inversely with the acceleration factor sometimes misleads engineers to assume this is always the case with linear acceleration. This belief is not correct. In general, the failure rate changes in a very nonlinear fashion under linear acceleration of the time scale. The simple results of this section apply only for exponential distributions.

244

Applied Reliability

EXERCISE 8.5 Suppose that a normal distribution with parameters μS = 500 hours and σS = 60 hours gives an adequate fit to high stress–level test failures for a certain component. In addition, based on past testing, you believe there is a linear acceleration factor of 2000 between stress hours and use-condition equivalent hours. Show that the distribution of failures at use conditions follow a normal distribution but both μU and σU are 2000 times larger than μS and σS. EXAMPLE 8.1 ACCELERATION FACTORS FOR EXPONENTIAL DISTRIBUTION A component, tested at 125°C in a laboratory, has an exponential distribution with a mean time to fail (MTTF) of 4500 hours. Typical use temperature for the component is 32°C. Assuming an acceleration factor of 35 between these two temperatures, what will the use failure rate be and what percent of these components will fail before the end of the expected useful life period of 40,000 hours?

Solution The MTTF is the reciprocal of the constant failure rate and varies directly with the acceleration factor. Therefore, the MTTF at 25°C is 4500 × 35 = 157,500 hours. The use failure rate is 1/157,500 = 0.635%/K (percent per thousand hours). The cumulative percent of failures at 40,000 hours is given by 1 − e−0.00635×40 = 22.4%. EXERCISE 8.6 In Example 8.1, suppose the reliability requirement is that no more than 10% of the components fail by 40,000 hours at use conditions. To achieve this reliability, it is proposed to redesign the box environment so that the component-operating temperature is significantly lower. What acceleration factor is needed from this lower operating temperature to the 125°C laboratory stress?

8.3 Acceleration Factors for the Weibull Distribution We assume that FS(t) has a Weibull distribution with characteristic life αS and shape parameter βS. The equation for the CDF is

βS

FS (t) = 1 − e−(t αS )

and, transforming to use stress, we have

FU (t) = FS (t/AF) = 1 − e−[(t/AF )/αS ]βS = 1 − e−[t/( AF × αS )]βS = 1 − e− (t/αU )βU

where αU = AF × αS and βU = βS = β. This result shows that if the life distribution at one stress is Weibull, the life distribution at any other stress is also Weibull (assuming simple linear acceleration). The shape parameter remains the same while the stress characteristic life parameter is multiplied by the acceleration factor. The equal-shape result, βU = βS, is a necessary mathematical consequence of simple linear acceleration. It is often mistakenly added as an additional assumption for a linear acceleration model and a Weibull life distribution. If different stress levels yield data with

245

Physical Acceleration Models

very different shape parameters, then either we do not have simple linear acceleration or the Weibull is the wrong distribution for the data. In Chapter 6, when we discussed Weibull probability plots, we saw that the shape parameter is estimated from the slope of the line fitted to the data, when plotted on Weibull graph paper. Therefore, when we have Weibull acceleration and we plot data from several accelerated stress levels on the same Weibull graph, their fitted lines should be approximately parallel. The lines will not be exactly parallel, of course, since we are dealing with random sample data. However, lines that are very far from parallel would indicate either model or data problems. By calculating the Weibull failure rate at both high-stress and use-stress levels, it is easy to see how h(t) varies due to acceleration. For high stress S, hS(t) = (β/αS)(t/αS)β−1. By writing the characteristic life parameter for use stress U as αU = AF × αS, we obtain

1 β hU (t) = AF × α S AF × α S

β−1

1 β t = AFβ α S α S

β−1

=

hS (t) AFβ

This is a linear change in the failure rate, but the multiple is 1/AF only when β = 1 and the distribution is exponential; otherwise, the failure rate hS(t) is multiplied by 1/(AF)β. EXERCISE 8.7 An engineer is able to achieve an additional 100× acceleration factor for a key component by reducing a box’s operating temperature. He claims that this will reduce the component’s failure rate by at least 100×. Assuming the component closely follows a Weibull life distribution, is his claim always correct? If you disagree, can you give a counterexample? EXAMPLE 8.2 WEIBULL ANALYSIS OF HIGH-STRESS FAILURE DATA Random samples of a manufacturer’s capacitors were tested to estimate how temperature accelerates failure times. Three temperatures were used: one cell each at 85°C, 105°C, and 125°C. Each cell, or controlled oven, contained 40 capacitors operating at the oven temperature. New failures were determined at the following readout times (RTs): 24, 72, 168, 300, 500, 750, 1000, 1250, and 1500 hours. All testing was completed in about 10 weeks, with results given in Table 8.2. Assuming a Weibull distribution, plot all three cells of test data on the same Weibull probability plot. Estimate the Weibull parameters α and β for each cell individually and then estimate the α values for each cell for a common β. Visually examine whether a common β is reasonable.

Solution We will show two approaches for the analysis of Table 8.2 data: the first is based on least squares (LS) regression, and the second is based on maximum likelihood estimation (MLE).

Solution Using LS Regression We can use a spreadsheet program and plot all three cells of test data on the same Weibull probability plot using the techniques described in Chapter 6. Plot Y = the log of the readout times at which failures were observed versus X = the transformed CDF estimate ln[ − ln(1 − Fˆ )]. Then, using LS regression, we can calculate the slope and intercept for each cell and then estimate the Weibull parameters α and β. A visual check can be done to determine whether the equal-slope (same β) consequence of simple acceleration looks reasonable.

246

Applied Reliability

TABLE 8.2 Weibull Temperature–Stress Failure Data Readout Time

85°C Cell, New Failures

105°C Cell, New Failures

125°C Cell, New Failures

1 0 0 1 0 3 0 1 2 8/40

2 1 3 2 2 4 5 1 4 24/40

5 10 13 2 3 2 2 1 0 38/40

24 72 168 300 500 750 1000 1250 1500 Failures/sample size

Cell(ºC) 85 85 85 85 85 105 105 105 105 105 105 105 105 105 125 125 125 125 125 125 125 125

Freq

F = CDF Est

X = ln[−ln(1 − F)]

RT

Y = LN(RT)

1 1 3 1 2 2 1 3 2 2 4 5 1 4 5 10 13 2 3 2 2 1

0.025 0.050 0.125 0.150 0.200 0.050 0.075 0.150 0.200 0.250 0.350 0.475 0.500 0.600 0.125 0.375 0.700 0.750 0.825 0.875 0.925 0.950

−3.68 −2.97 −2.01 −1.82 −1.50 −2.97 −2.55 −1.82 −1.50 −1.25 −0.84 −0.44 −0.37 −0.09 −2.013 −0.755 0.186 0.327 0.556 0.732 0.952 1.097

24 300 750 1250 1500 24 72 168 300 500 750 1000 1250 1500 24 72 168 300 500 750 1000 1250

3.18 5.70 6.62 7.13 7.13 3.18 4.28 5.12 5.70 6.21 6.62 6.91 7.13 7.31 3.18 4.28 5.12 5.70 6.21 6.62 6.91 7.13

FIGURE 8.1 Worksheet for the data in Table 8.2.

The spreadsheet program worksheet is shown in Figure 8.1. Note that no median rank corrections to the F(t) estimates are needed because we have readout times and not exact times of failure. Figure 8.2 is the Weibull probability plot of all three cells along with the linear regression lines fit to the probability plot points. The slopes and intercepts are displayed for each line. Since the β estimate is the inverse of the slope and the α estimate is the exponentiation of the intercept, the Weibull parameter estimates are as displayed in Table 8.3.

247

Physical Acceleration Models

Weibull probability plot of Table 8.2 data 8.00

y = 1.7987x + 10.298

y = 1.3934 x + 7.6599

7.00

y = 1.2921x + 5.47

ln(RT)

6.00 85 105

5.00

125 Linear (85)

4.00

Linear (105) Linear (125)

3.00 2.00 −4.00

−3.00

−2.00

−1.00 ln( − ln(1 − F))

1.00

0.00

2.00

FIGURE 8.2 Weibull probability plot in worksheet.

TABLE 8.3 Weibull Least Square Parameter Estimates Weibull LS Estimates Slope β = 1/slope Intercept α = exp(intercept)

85°C

105°C

125°C

1.798691 0.555960 10.29770 29,664.4

1.393413 0.717662 7.65993 2,121.6

1.292097 0.773935 5.46997 237.5

Indicator Variables for Common Slope under LS Regression Table 8.3 shows the individual parameter estimates for each cell. However, we need to estimate the LS parameters with a common-shape parameter β. To do so involves the use of indicator variables. The technique is described by Neter et al. (1996). For the three cell temperatures, two indicator variables X1 and X2 are defined as follows: X1 =

1 0

if 85° otherwise

X2 =

1 0

if 105° otherwise

The first-order model is then given by

ln(RTi ) = b0 + b1X1 + b2 X 2 + b3 ln[ln(1 − Fˆi )]. This is a multiple regression equation, and the coefficients bi can be determined in a spreadsheet program. The worksheet with indicator variables is shown in Figure 8.3.

248

Applied Reliability

Cell(ºC) 85 85 85 85 85 105 105 105 105 105 105 105 105 105 125 125 125 125 125 125 125 125

Freq

CDF Estimate

X1

X2

ln[−ln(1−F)]

RT

LN(RT)

1 1 3 1 2 2 1 3 2 2 4 5 1 4 5 10 13 2 3 2 2 1

0.025 0.050 0.125 0.150 0.200 0.050 0.075 0.150 0.200 0.250 0.350 0.475 0.500 0.600 0.125 0.375 0.700 0.750 0.825 0.875 0.925 0.950

1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

−3.68 −2.97 −2.01 −1.82 −1.50 −2.97 −2.55 −1.82 −1.50 −1.25 −0.84 −0.44 −0.37 −0.09 −2.01 −0.76 0.19 0.33 0.56 0.73 0.95 1.10

24 300 750 1250 1500 24 72 168 300 500 750 1000 1250 1500 24 72 168 300 500 750 1000 1250

3.18 5.70 6.62 7.13 7.31 3.18 4.28 5.12 5.70 6.21 6.62 6.91 7.13 7.31 3.18 4.28 5.12 5.70 6.21 6.62 6.91 7.13

FIGURE 8.3 Worksheet using indicator variables.

We now use a multiple regression program to solve for the coefficients in the first-order model. We can use the spreadsheet program LINEST or the regression feature of the Data Analysis Tools add-in provided with Excel. We have reproduced a portion of the latter’s summary output in Figure 8.4. The common slope β estimate is the reciprocal of the coefficient of ln(−ln(1 − F)). Thus, the β estimate is 0.702860. To find the αs for each temperature cell, we note that at time t = α, ln(−ln(1 − F)) = 0. Thus, for 125°C, since X1 = X2 = 0, the first-order model shows that α = exp(b0) = 233.30. For 105°C, since X1 = 0 and X2 = 1, we get α = exp(b0 + b2) = 2,204.96. For 125°C, since X1 = 1 and X2 = 0, we get α = exp(b0 + b1) = 12,054.73. Formal statistical tests for the equality of LS slopes can be performed (Kleinbaum et al. 1998). For now, we will depend on what appears to be a reasonable visual confirmation of equal slopes. However, we will describe a straighforward test for parallel slopes using MLE methods in Sections 8.5 and 8.6. We have solved this example using a spreadsheet program to illustrate several important concepts. However, the solution can be easily obtained from Minitab simply by selecting the LS estimates in the “Parametric Distribution Analysis (Arbitrary Censoring)” part of the Reliability/ Survival platform and checking the box for a common slope. The Minitab graphical output is shown in Figure 8.5 (the Minitab worksheet is shown in Figure 8.7).

Solution Using MLE Methods We will now find different Weibull parameter MLEs for each stress cell using Minitab and JMP. In addition, we can calculate the MLEs assuming the shape parameter β is the same for all three cells. Then, we record the minimized negative log likelihoods of the data under both the separate

249

Physical Acceleration Models

Coefficients Intercept X1 X2 ln(−ln(1 − F)

5.452319894 3.944892598 2.246146363 1.422757986

b0 b1 b2 b3

FIGURE 8.4 Portion of data analysis summary output.

Probability plot for interval start Weibull Arbitrary censoring–LSXY estimates

Percentage

99

Temperature 85 105 125

90 80 70 60 50 40 30 20

Table of statistics Scale Corr Shape 0.702860 12054.7 0.959 0.702860 2205.0 0.989 0.702860 233.3 0.975

10 5 3 2 1

0.1

1.0

10.0

100.0 1000.0 Interval start

10000.0

100000.0

FIGURE 8.5 Minitab output for LS analysis of the data in Table 8.2.

cell distributions and the constant-slope assumptions. These will be used in the next section to test the hypothesis of “simple Weibull acceleration.” Finally, we use the ratio of the α parameter estimates (under the equal-shape, linear-acceleration assumption) to compute the acceleration factors between 85°C and 105°C; 85°C and 125°C; and 105°C and 125°C. The Weibull plot in Figure 8.6 was made with Minitab 15. Start by setting up a Minitab worksheet for the data in Table 8.2 (Ex8.2.MTW, available from the “Downloads & Updates” tab on the web page for this book at http://www.crcpress.com/product/isbn/9781584884668.). The format for interval data was illustrated in Chapter 4, Figure 4.12. Here, the three temperatures are entered one on top of another as is shown in the partial worksheet in Figure 8.7. A “Temperature” column is added to identify the temperature corresponding to each row of data. Follow all the steps described in Chapter 4 for using the Reliability/Survival Platform “Distribution Analysis (Arbitrary Censoring)” with the “Parametric Distribution Analysis…” option. In addition, select the Temperature column as a By Variable and choose the MLE Method under the Estimate… button. Also, use the Graphs… button to deselect the confidence limits option—this will avoid cluttering the plot.

250

Applied Reliability

Probability plot for interval start Weibull Arbitrary censoring–ML estimates

Percentage

99

Temperature 85 105 125

90 80 70 60 50 40 30 20

Table of statistics Scale AD* Shape 0.722934 12270.4 84.040 0.722934 1849.3 36.661 0.722934 229.0 0.892

10 5 3 2 1

0.1

1.0

10.0

100.0 1000.0 Interval start

10000.0

100000.0

FIGURE 8.6 Minitab Weibull plot of the data in Table 8.2.

↓

C1 Interval Start

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

* 24 72 168 300 500 750 1000 1250 1500 * 24 72 168 300 500 750 1000 1250 1500

C2 Interval End 24 72 168 300 500 750 1000 1250 1500 * 24 72 168 300 500 750 1000 1250 1500 *

FIGURE 8.7 Partial Minitab worksheet for the data in Table 8.2.

C3

C4

Frequency

Temperature

1 0 0 1 0 3 0 1 2 32 2 1 3 2 2 4 5 1 4 16

85 85 85 85 85 85 85 85 85 85 105 105 105 105 105 105 105 105 105 105

251

Physical Acceleration Models

Note how the lines in Figure 8.6 are nearly parallel. This is a good sign for the existence of “simple acceleration.” From the Minitab output, we obtain the MLEs for the three temperatures as follows: 85°C: βˆ = 0.81, αˆ = 9775, minimum −log likelihood = 37.170 105°C: βˆ = 0.82, αˆ = 1746, minimum −log likelihood = 79.429 125°C: βˆ = 0.68, αˆ = 221, minimum −log likelihood = 80.189 The sum of the minimized −log likelihoods for all the data is 196.788 To have Minitab redo the analysis when all temperatures have the same shape value, check the assume common shape option available from the Estimate menu screen. As shown in Figure 8.8, the common βˆ is 0.723 and the αˆ estimates are 12,270; 1849; and 229 for 85°C, 105°C, and 125°C cells, respectively. From the Minitab output, the minimum negative log likelihood, summed for all three cells, is 197.160. JMP release 6 (or later) also provides a Weibull probability plot similar to that in Figure 8.2. The JMP analysis begins by stacking the three temperature cells on top of one another, together with a temperature column identifying the stress level (Figure 8.9). Note that the interval start and stop points and the temperature column are continuous variables, whereas the frequency column is numeric-ordinal. Fill out the JMP Reliability/Survival menu as shown in Figure 8.10, click OK, and then use the red button to add Weibull Plot and Weibull Fit to the JMP output screen. The JMP plot and MLEs are shown in Figure 8.11. Casting Temperature as a “Grouping” rather than a “By” variable results in a combined probability plot and table with all three temperatures shown. Doing an analysis with a constant β across the three cells requires some additional inputs with JMP. In JMP 7, the “Fit Parametric Survival” screen has been designed with great generality—but this comes at a cost. The user has to understand the models that JMP is fitting and the method of working with these models in order to answer reliability questions. We describe a way of inputting the Table 8.2 data into JMP using indicator variables, which results in direct Weibull parameter

Probability plot for interval start Weibull Arbitrary censoring–ML estimates

Percentage

99

Temperature 85 105 125

90 80 70 60 50 40 30 20

Table of statistics Scale AD* Shape 0.722934 12270.4 84.040 0.722934 1849.3 36.661 0.722934 229.0 0.892

10 5 3 2 1 0.1

1.0

10.0

100.0 1000.0 Interval start

FIGURE 8.8 Minitab Weibull plot of the data in Table 8.2: equal slopes.

10000.0

100000.0

252

FIGURE 8.9 JMP-7 data table for analysis of the data in Table 8.2.

FIGURE 8.10 JMP-7 reliability/survival screen inputs.

Applied Reliability

Physical Acceleration Models

253

FIGURE 8.11 JMP–Weibull plot and MLEs for the data cells of Table 8.2.

estimates of ln α and α for the different temperatures. We also show an alternative input method in Appendix 8.1A that involves a smaller input-data table; however, it requires some additional calculations. Starting with the JMP worksheet shown in Figure 8.9 (Ex8.2.jmp), add three more columns to the right. Label the first column “ln α 85” and put the number one in every row corresponding to an 85°C cell and a zero in the rows corresponding to data from the other cells. Label the second column “ln α 105” and put the number one in every row corresponding to the 105°C cell and a zero in the rows corresponding to data from the other cells. Finally, label the third column “ln α 125” and put the number one in every row corresponding to the 125°C cell and a zero in the rows corresponding to data from the other cells. These new columns (numeric-ordinal in JMP) are called indicator variables. The data table is shown in Figure 8.12. Now run the Fit Parametric Survival platform, as shown in Figure 8.13, with two changes: Instead of adding Temp in the Construct Model Effects box, add the three new columns as effects. In addition, check the No Intercept box at the bottom of the platform screen. (Note: When you run the analysis, some releases of JMP may give you one or more error messages. Ignore them, and select OK until you get the analysis output.) The output, as shown in Figure 8.9, contains the ln α estimates and the α estimates under the “Alternate Parametrization” for the

254

Applied Reliability

FIGURE 8.12 JMP data table for common-slope analysis of the data in Table 8.2 using indicator variables.

FIGURE 8.13 JMP fit parametric survival model screen inputs for common-slope analysis of the data in Table 8.2.

Physical Acceleration Models

255

FIGURE 8.14 JMP fit parametric survival model screen outputs for common-slope analysis of the data in Table 8.2.

three temperatures. The α values were obtained by exponentiation of the ln α values, as shown in Figure 8.14. The parameter estimates and negative log likelihood agree with the Minitab results for a common-shape parameter shown in Figure 8.8. The estimate of the acceleration factor between 105°C and 85°C is 12,270/1,849 = 6.64. An estimate for the acceleration between 125°C and 85°C is 53.58, and the acceleration between 125°C and 105°C is 8.07. The MLEs for the separate temperature cells, with or without the same slope constraint, can also be obtained from the Genweibest.xls template. Use of this template was described in the Appendix to Chapter 4. In the Appendix to this chapter, we show how to use Genweibest.xls to analyze the data in Example 8.2. Although Figures 8.7 and 8.11 visually support the equal-slope or “simple acceleration” assumption, we need a quantitative way to test the validity of this assumption. Section 8.4 shows how to use the minimum −log likelihood values to test hypotheses such as “equal slopes.”

An important point should be noted about the experimental design described in Example 8.2. It is common to find reliability experiments set up in a balanced way, with equal numbers of units in each stress cell. Although this would be a standard approach when designing experiments where each unit tested produces a response, it is far from optimal when setting up lifetest reliability experiments. The number of failures observed in a cell is the primary factor for determining the precision of parameter estimates for that cell. As discussed in Section 8.13, it is better to allocate more units to the lower stress cells than to the higher stress cells. EXERCISE 8.8 The MLE of the single β across all three temperature cells in Example 8.2 turned out to be closer to the β estimate of 0.68 from the 125°C cell, rather than the MLEs 0.81 and 0.82 for β from the 85°C and 105°C cells. Does this make sense? Why?

256

Applied Reliability

8.4 Likelihood Ratio Tests of Models In Section 4.5, we showed how to write the likelihood (LIK) function for any set of data. This procedure can be generalized to handle data from several stress conditions by multiplying together the LIK functions from the different conditions to form one overall LIK function. If each condition has its own separate parameters, then the MLEs will be the same as if each condition were analyzed separately. If, however, one or more parameters are assumed to be the same across several conditions, then the resulting MLEs are meaningful only if the hypothesis of equality is true. We will use the maximized value of LIK (or the minimized value of L = −log LIK) as a measure of “how likely that model is.” As an example, assume two vendors submit units of the same kind of component and you have to decide whether they are equivalent, as far as reliability is concerned. After running high-stress life tests on both samples, how would you decide whether the data indicate no significant differences between the vendors or, at the other extreme, that one vendor is significantly better than the other? Clearly, this kind of decision can have major financial impacts on the vendors and on the reliability of the products in which the components are used. An objective statistical test is needed, rather than subjective decisions based on how good each set of data looks on its own. Depending on the assumed distribution and the kind of data obtained, there may be many different statistical procedures that test the hypothesis that both vendors are equal versus the alternative that one vendor is significantly better than the other. One method, known as the likelihood ratio (LR) test, is generally applied. In particular, for censored life test data, or for readout data, it is often the only easy method. The LR test uses the L value (the minimized value of L = −log LIK) assuming that the hypothesis is true (for example, for the two vendors, that the two samples come from the same distribution). Call this likelihood L1. It is computed by pooling the two data sets and fitting a single distribution to the pooled data. Next, separately fit the distribution to each data set, allowing each one to have its own distribution parameters. Call the sum of the minimum −log LIK values for these data sets L2. It is the smallest possible negative log likelihood value we can obtain from the two cells (assuming we do not change the life distribution model assumption) because separate parameters were fitted to each data set. Intuitively, the closer L1 is to L2, the more reasonable the hypothesis of equality is. On the other hand, if L is much smaller when each sample is allowed to have its own parameters, then the hypothesis of equality is much less reasonable. The LR test gives us a way to decide how large a difference is statistically significant. Define the LR test statistic LRT = 2(L1 − L2). Then, assuming the hypothesis of equality is true, LRT will have (approximately) a chi-square distribution. The chi-square degrees of freedom (dfs) equal the difference between the numbers of the parameters estimated for the two models. In the two-vendor case, with a two-parameter Weibull, the number of parameters under L1 is two, whereas the number of parameters under L2 is four. So, LRT has approximately a chi-square distribution with 4 − 2 = 2 dfs. If the calculated value of LRT exceeds a high percentile (say, the 95th or 97.5th percentile) of the chi-square distribution with 2 dfs, then there is convincing evidence for rejecting the idea that the vendors are equal. Otherwise, we cannot conclude that the vendors are convincingly different. A more general statement of the LR test is as follows: Suppose we have one or more samples of data. First, we write the LIK functions of all the data with unknown distribution parameters that differ from sample to sample. Using available software, we determine the minimized negative log likelihood with respect to all the unknown parameters. These parameter values are the same as the MLEs that maximize LIK. Call this minimized negative log likelihood L2.

257

Physical Acceleration Models

Next, we make assumptions we are interested in investigating about the distribution parameters across the data cells. These assumptions may be that some parameters have known values, or that one or more parameters are unknown but equal across several cells, or even that there is an equation or model that relates the values of some of these parameters across the cells (and this model may include stress terms and “acceleration model” parameters). Note that this new set of assumptions or model has to be a special case of the model used to obtain L2. We then write the LIK functions of all the data under this new model and again minimize the negative log likelihood with respect to the new set of unknown parameters. Call this second minimized negative log likelihood L1 and the model it is based on model 1. Let LRT = 2(L1 − L2). In addition, let d be the difference between the number of parameters estimated when calculating L1 and the number of parameters estimated when calculating L2. Then, if model 1 is true, LRT has an approximate chi-square distribution with d dfs. If the calculated value of LRT is too large (as compared with χ 2d;.95 or χ 2d;.97.5, or some other [preselected] high percentile of the chi-square distribution with d dfs), then we have significant sample evidence to reject model 1. We can use the LR test to decide whether a set of Weibull data can be modeled reasonably well by an exponential distribution (just test whether β = 1). Similarly, we can test whether several cells of data have the same slope (just test whether all the cells have the same β). We will also use a LR test later in this chapter to test whether a fitted acceleration model is consistent with the observed data. Both JMP and Minitab calculate L1 and L2. All we need to do is calculate LRT and d and look up a chi-square percentile. We can even use an Excel spreadsheet to obtain MLEs and negative log LIKs, as is shown in Appendix 8.2A. EXAMPLE 8.3 WEIBULL LIKELIHOOD EQUAL-SHAPES TEST Use the LR test to decide whether it is reasonable to assume equal shapes for the three different temperature cells of data shown in Example 8.2. (We also show in Appendix 8.2A how this calculation can be carried out using an Excel spreadsheet and Excel Solver.)

Solution In Example 8.2, Minitab provided the negative log LIK for the three independent cells as L2 = 196.788. Assuming a common β, the result is L1 = 197.160. (To obtain the negative log LIK in JMP for the three independent cells, we run the Fit Parametric Survival routine but enter temperature as the “By” variable and not in the “Construct Model Field” area. Then, we add the individual −LogLikelihood values in the “Whole Model Test” output for the three temperatures.) The chi-square statistic is LRT = 2 × (197.160 − 196.788) = 0.744. The dfs are six (the number of parameters estimated assuming three independent cells) − 4 (the number of parameters estimated if the shapes are all the same) = 2. Using the spreadsheet function 1−CHIDIST(0.744,2), we see that 0.744 is the 31st percentile of the chi-square distribution with 2 dfs. Hence, there is no reason to reject the assumption of equal shape values—that is, the data are consistent with a single β value. EXERCISE 8.9 Analyze DATA3 in Problem 1.5 (at the end of Chapter 1) assuming a Weibull fit. Find MLEs for both Weibull parameters. Then set β = 1 (an exponential distribution) and estimate λ, the failure rate. Also use the LR test to test the hypothesis that β = 1. EXERCISE 8.10 Look at DATA3 and DATA4 from Problem 1.5. Use the LR test to decide whether it is reasonable to fit them with Weibull distributions with the same shape parameter. EXERCISE 8.11 Test whether the three sets of data in Exercise 4.10, can be adequately modeled with the same Weibull distribution (as opposed to three separate Weibull distributions).

258

Applied Reliability

8.5 Confidence Intervals Using the LR Method In the last section, we saw that the LR test is a versatile methodology for testing hypotheses concerning model parameters. This test applies to multicensored reliability test and field data, whereas most other statistical methods fall short. Moreover, the numbers needed to perform LR tests are readily calculated by current statistical software packages, such as JMP and Minitab. Even the widely used spreadsheet package Excel, with the help of its built-in Solver tool, can calculate all the negative log LIKs needed for an LR test (see the appendix to this chapter for details about spreadsheet calculations and downloadable Excel spreadsheets). However, there are other ways to use this powerful tool. Chapters 4 and 5 showed how to construct confidence intervals called profile likelihood intervals. This section explains how these intervals are an extension of the LR test. Tests of hypotheses and confidence limits have a natural relationship. If we look at all parameter values that would be accepted for a test at the 95% confidence level (i.e., not lead to a rejection by observing a statistic with less than a .05 probability of occurring if the hypothesis is true), the smallest and largest of these parameter values are the limits of a 95% confidence interval for the parameter. So, we can use this test to construct confidence intervals by determining the upper and lower parameter values where T = 2(L1 − L2) is less than χ12;.95 = 3.8415 or, equivalently, L1 is no larger than L2 + 1.92073 (for a 95% confidence interval). Here, L2 is the minimum value of −log LIK obtained when all parameters are set at their respective MLE values. These confidence limits are described in detail in this section, using the lognormal distribution as an example. Assume we have used the MLE method to get the maximum LIK estimates of ˆ = lnT50 and σˆ using a lognormal life distribution. We want the upper and lower confidence limits for both these parameters, each of which encloses the true value (approximately) 100 × (1−α)% of the time. If the minimized −log LIK is L2 and we look for all combinations of (σ, μ) that result in a −log LIK less than L2 + 1.92073, we might come up with the ellipsoid region shown in Figure 8.15. The extreme parameter values in this region (as shown in Figure 8.15) are the 95% confidence limits for each parameter.

Mu MLE solution For mu and sigma

Mu upper limit

95% likelihood region for both parameters where −2log LIK is less than the minimum value plus 1.92073

Mu lower limit

Sigma lower limit

Sigma upper limit

FIGURE 8.15 Region for finding lognormal parameter confidence limits.

Sigma

259

Physical Acceleration Models

This approach is used in the Survival Parametric Platform and the Nonlinear Model Fit platform in JMP (release 5 or later) software to construct confidence limits. Intervals obtained in this way are known as likelihood confidence intervals or profile likelihood confidence intervals. The JMP default (as of release 6) is 95%. However, we can use a spreadsheet to obtain 80%, 90%, or 99% intervals (instead of 95%) by just changing the value 1.92073 to 0.82119, 1.35277, or 3.31745, respectively (see Table 4.1A in Appendix 4.3A). Even though it seems quite complicated mathematically to calculate the ellipsoid region in the parameter space to get the confidence limits, it is easy to accomplish using Excel’s built-in Solver and Goal-Seek routines. This method was shown in Appendix 4.3A. We can use the same technique to find a confidence interval for the common β in Example 8.2. The spreadsheet solution is found in Appendix 8.3A*. EXAMPLE 8.4 CONFIDENCE-BOUND CALCULATION FOR A COMMON WEIBULL SLOPE Use JMP to find an approximately 95% confidence interval for the common Weibull slope β for the data in Table 8.2.

Solution The JMP equal-β fit output for the data in Table 8.2 is in Figure 8.14. JMP can display LR confidence intervals by selecting (clicking) the red options button next to Parametric Survival Fit and the menu item Confidence Intervals. The output is shown in Figure 8.16. Because JMP uses a different parameterization, for the Weibull distribution, β = 1/δ. The limits (1.138, 1.715) for δ translate to the 95% confidence interval (0.583, 0.879) for β. We can find a confidence interval for α for each of the three temperatures by exponentiating the upper and lower bounds on the parameter estimates in Figure 8.16. We leave this calculation for the exercise. EXERCISE 8.12 Using the output in Figure 8.16, estimate the confidence limits for the α parameters for the three temperatures.

FIGURE 8.16 JMP output for Weibull parameter likelihood confidence limits for data in Table 8.2.

260

Applied Reliability

8.6 Lognormal Distribution Acceleration Now, we model FS(t) at laboratory stress by the lognormal distribution

FS (t) = Φ

ln [t/T50S ] σS

where Φ is the standard normal CDF defined in Chapter 5, and T50S and σS are the lognormal parameters for the laboratory-stress life distribution. Using the acceleration transformation of the time scale given by FU (t) = FS (t/AF),

we obtain

FU (t) = FS

t =Φ AF

ln

t/AF T50S σS

ln =Φ

t AF × T50S σS

ln =Φ

t T50U σU

.

FU(t) is a lognormal distribution with σU = σS = σ, and T50U = AF × T50S, where σU and T50U are the use-stress parameter values, and AF is the acceleration factor that converts a time to failure at laboratory stress to the corresponding time to failure at use stress. This result is similar to that for the Weibull: simple linear acceleration does not change the type of distribution or the shape parameter. Only the scale parameter is multiplied by the acceleration factor between the two stresses. Since the shape or σ is the slope on standard lognormal paper, we once again expect different test conditions to give rise to nearly parallel lines when plotted on the same graph, provided linear acceleration is an adequate model. The ratio of the times taken to reach any chosen percentile, such as T50, gives an estimate of the acceleration factor between two stress levels. As we saw with the Weibull distribution, parallel stress cell lines or equal σs arise as a consequence of simple linear acceleration and a lognormal life-distribution model assumption, without any additional assumptions needed. In addition, the relationship between the failure rates before and after acceleration is complicated and depends on the particular time point under evaluation. Failure rates must be calculated using the basic definition for h(t) given in Chapter 5. EXAMPLE 8.5 LOGNORMAL STRESS-FAILURE DATA A semiconductor circuit module has been observed to fail due to metal ions migrating between conductor lines and eventually causing a short. Both temperature and voltage affect the time taken for such failures to develop. It is decided to model the kinetics of failure by conducting an experiment using six different combinations of temperature and voltage. Table 8.4 gives the experimental design matrix for this experiment. The stress cells are as follows: (125°C, 8 volts);

261

Physical Acceleration Models

TABLE 8.4 Experimental Design Matrix

8 volts 12 volts 16 volts

85°C

105°C

50 50

50 50

125°C 50 50

TABLE 8.5 Lognormal Stress-Failure Data Readout 24 100 150 250 500 750 1000 Survivors at 1000 Total

Cell 125°C/8 V

Cell 125°C/12 V

Cell 105°C/12 V

Cell 105°C/16 V

Cell 85°C/12 V

Cell 85°C/16 V

0 1 6 6 23 9 3 2

0 4 3 23 13 5 1 1

0 0 1 3 19 15 6 6

0 0 1 10 21 10 5 3

0 0 0 1 1 3 5 40

0 0 0 1 5 15 6 23

50

50

50

50

50

50

(125°C, 12 volts); (105°C, 12 volts); (105°C, 16 volts); (85°C, 12 volts); and (85°C, 16 volts). The use stress is (25°C, 4 volts), and sample size is 50 per cell. No units were run at (85°C, 8 volts), (105°C, 8 volts), and (125°C, 16 volts) because of limited resources and a concern that few failures would occur at the two low stress–level combinations. Each cell is readout for new failures at 24, 100, 150, 250, 500, 750, and 1000 hours. The failure data are presented in Table 8.5. Use a lognormal failure distribution and plot the six cells of data. Visually check whether the equal-sigma property of “simple acceleration” looks reasonable. Estimate the parameters for each stress cell using MLE and do a LR test for equal σs across all cells. Find the 95% confidence intervals for each cell’s = lnT50 parameter and for the common σ.

Solution We use Minitab’s Release 15 software (Reliability/Survival, Distribution Analysis–Arbitrary Censoring–Parametric Distribution Analysis Platform). It easily produces graphs showing all the cells plotted together. It fits models with and without the constant σ constraint and displays MLE lines on the plot. Make sure to choose the Maximum Likelihood option in the Estimation screen. For the constant-slope analysis, also choose the Assume common shape option. The Minitab worksheet for this example is EX8.5.MTW. Figure 8.17 displays the lines when each cell is fitted independently (the “Cell” column is a “By Variable”). Figure 8.18 shows the fit for the same cells after selecting the assume common shape option. Minitab labels μ = lnT50 as the “Loc” parameter and σ as the “Scale” parameter. This figure has the axes transposed from the usual Minitab default so that “Time” is on the vertical axis. The “AD” statistic is a measure of fit of the model; it is useful when comparing which of several distributions fits a set of data best (smaller “AD” values indicate a better fit).

262

Applied Reliability

Lognormal Arbitrary censoring–ML estimates Cell 1 - 125/8 2 - 105/12 2 - 125/12 4 - 105/16 5 - 85/12 6 - 85/16

Time

10000

Table of statistics Loc Scale AD* 5.84566 0.634621 1.207 6.26039 0.545433 2.375 5.49655 0.611942 0.904 6.01588 0.564495 1.133 7.65973 0.880214 67.239 6.83012 0.598759 21.745

1000

100

1

5

10

20

30 40 50 60 70 80 Cumulative percent fail

90

95

99

FIGURE 8.17 Lognormal plot—cumulative percent failure data from Table 8.5.

Lognormal Arbitrary censoring–ML estimates (common slopes)

10000

Cell 1 - 125/8 2 - 105/12 2 - 125/12 4 - 105/16 5 - 85/12 6 - 85/16

1000 Time

Table of statistics Loc Scale AD* 5.84545 0.603139 1.292 6.26197 0.603139 2.334 5.49675 0.603139 0.921 6.01589 0.603139 1.081 7.37500 0.603139 67.244 6.83182 0.603139 21.745

100

1

5

10

FIGURE 8.18 Lognormal plot (common slope).

20

30 40 50 60 70 80 Cumulative percent fall

90

95

99

263

Physical Acceleration Models

TABLE 8.6 Lognormal Stress Cell Parameter Estimates ML Estimates (Separate Sigmas) CELL 1 2 3 4 5 6

ML Estimates (with One Sigma)

T50 (hours)

SIGMA

−Log Likelihood

T50 (hours)

SIGMA

−Log Likelihood

346 244 523 410 2121 925

0.63 0.61 0.55 0.56 0.88 0.60

79.660 77.560 74.701 75.181 37.795 66.042

346 244 524 410 1596 927

0.60 0.60 0.60 0.60 0.60 0.60

79.775 77.569 75.058 75.357 38.947 66.043

Figure 8.17 shows that all the lines have very similar slopes; only cell 5, with fewer failures than the other cells, has a somewhat different slope estimate. Estimates of T50, σ, and the minimized negative LIK values from the Minitab output are given in Table 8.6. The constrained MLEs, with a common σ equal across all cells, are the appropriate statistical estimates to use under the assumptions of linear acceleration and lognormality. Adding up the negative log LIK columns in Table 8.6, we see that L 2 is 410.939 and L1 is 412.749. The chi-square statistic for the LR test is LRT = 2 × (412.749 − 410.939) = 3.62. The dfs are the difference between 12 parameters estimated from the 6 separate cells and only 7 parameters estimated for the same-slope analysis, or 12 − 7 = 5. Since 3.62 is around the 40th percentile for the chi-square distribution with 5 dfs, there is no reason to doubt the simple acceleration model and the common σs. Note that Minitab has a Test option that also does a statistical test of the hypothesis of a common slope. However, the Minitab calculation comes up with a chi-square statistic of 3.26 (with 5 dfs). Minitab uses a common, but different, statistical test statistic. Although the LR test is probably more powerful for censored reliability data, the results from both Minitab and the LR test suggest that a common σ model is reasonable. In the Appendix to this chapter, the same LR test statistics are easily obtained using Excel and Solver. We next obtain the 95% confidence intervals for the lnT50 parameters and the common σ. Figure 8.19 shows part of a JMP worksheet (Ex8.5.jmp) that will allow the Fit Parametric Survival Platform to calculate lnT50 and σ estimates (and 95% confidence bounds) under the assumption there is a common slope for all six stress cells. Note that six columns of indicator variables have been added to the JMP worksheet, as explained previously. The analysis output (with confidence limits selected) is shown in Figure 8.20. EXERCISE 8.13 Repeat the analysis of the Table 8.5 data using a Weibull life distribution. EXERCISE 8.14 Look at the L1 value for the Weibull analysis of the Table 8.5 data (same β for all cells) and compare it to the corresponding L1 value obtained from the lognormal analysis. Since a smaller negative log LIK is evidence of a better fitting distribution, which model, lognormal or Weibull, appears to fit better? (Note: This is a qualitative assessment. If historical evidence or theoretical arguments favor the distribution with a slightly higher negative log LIK, that should be a more important factor when choosing a life distribution.)

264

FIGURE 8.19 JMP analysis for calculating same-slope MLEs.

Applied Reliability

Physical Acceleration Models

265

FIGURE 8.20 JMP analysis results with same slope.

8.7 Acceleration Models In Section 8.3, we saw that knowing the scale parameters (either α or T50) at two different stress levels allows us to calculate the acceleration factor between these stresses. Alternatively, if we already know the acceleration factor between a laboratory high-stress test and the field-use condition, we can use the results of our test-data analysis to predict failure rates at use-stress conditions. Indeed, this practice is often done as an ongoing process monitor for reliability on a lot-by-lot basis. However, what can be done if an acceleration factor is not known and data can be obtained in a reasonable amount of time only by testing at high stress? The answer is that we must use the high-stress data and fit them to an appropriate model that allows us to extrapolate to lower stress levels. There are many models in the literature that have been successfully used to model acceleration for various components and their failure mechanisms. These models are generally written in a deterministic form that says that the time to failure is an exact function of the accelerating stresses and possibly some material and process variables. All times to failure are random events that cannot be predicted exactly. Also, we have seen that (linear) acceleration is equivalent to multiplying a distribution-scale parameter. Consequently, we will interpret (1) an acceleration model as an equation for a distributionscale parameter, or (2) a given percentile as a function of the levels of the accelerating stress or stresses.

266

Applied Reliability

For example, suppose a failure mechanism depends on the levels of two stresses and that the associated failure times are modeled by a lognormal distribution. Then, an equation T50 = G(S1,S2) that gives the T50 as a function of these two stresses is an acceleration model. In Sections 8.8, 8.10, and 8.11 we shall see several common and useful forms for the function G. Before proceeding to these models, however, one point should be stated. Just as, in general, different failure mechanisms are modeled with different life distributions, they can also have different acceleration models. In Chapter 10, we discuss the competing risk model. Using it, we can study each failure mode or mechanism separately and derive the total component failure rate as a sum of the individual failure rates from each mechanism. This bottoms-up method is virtually the only sound way to do acceleration modeling successfully (although a top-down “black-box” approach is also discussed in Chapter 10). For example, one failure mechanism might involve a chemical reaction and be accelerated by temperature. A component with this failure mode might also fail due to metal migration, which could be highly dependent on voltage and current density, in addition to temperature. At the same time there could also be a mechanical wear-out failure mode dependent on the frequency of on–off cycles. Each failure mode requires a completely different acceleration model and must be studied separately. Therefore, when we discuss acceleration models and the analysis of accelerated failure data, we are presupposing that the experiments have been carefully designed to produce data from only one failure mechanism, or any other types of failures have been “censored” out of the data analysis. The way we censor out failures due to a different mode from the mode of interest is to pretend those failure times were times when the units were removed from the test without having failed. The time of removal, that is, the censoring time, is the time that the unit actually failed due to some other failure mechanism; they are “right censored” at that time because they would have failed at some later time due to the failure mechanism under analysis. This assumes that the various failure mechanisms operate independently of one another, a simplifying assumption that generally works well in practice.

8.8 Arrhenius Model For tests where only the temperature is varied, an empirical model, known as the Arrhenius model, has been widely used with great success. This model takes the form H

T50 = Ae kT

where

A and ΔH are unknown parameters estimated from the data k = Boltzmann’s constant = 8.617 × 10−5 in eV/K or 1.380 × 10−16 in ergs/K (1/k = 11,605 K/eV) T = temperature in degrees Kelvin at the location on the component where the failure process occurs. Degrees Kelvin equals the Celsius temperature + 273.15

267

Physical Acceleration Models

For example, while testing integrated circuits, it is common practice to calculate the internal (called the “junction”) temperature of the device. It is the ambient (use or oven stress) temperature plus the incremental self-heating associated with the power dissipation of the device (see Tummala and Rymaszewski 1989 for a discussion of thermal considerations in microelectronics packaging). Note that we can write the Arrhenius model in terms of T50, or in terms of the α parameter (when working with a Weibull), or the 1/λ parameter (when working with an exponential), or any other percentile of whatever life distribution we use. The value of the constant A is often unknown, but as we will see, this detail has no effect on the acceleration factors. For convenience, we use α or T50 in this chapter, but everything applies equally well to other percentiles. We obtain the acceleration factor between temperatures T1 and T2 by taking the ratio of the times it takes to reach any specified CDF percentile. In other words, the acceleration factor AF between stresses 1 and 2 is defined to be the ratio of time it takes to reach P% failures at stress 1 divided by the time it takes to reach P% failures at stress 2. The assumption of simple acceleration makes this factor the same for all percentiles. For the Arrhenius model and the 50th percentile,

AF12 =

T50 (at T1 ) Ae = T50 (at T2 ) Ae

H /kT1 H /kT2

from which AF12 = e(

H /k )[( 1/T1 )−(1/T2 )]

and with a spreadsheet, we can calculate AF12 using the equation = EXP( H * 11605 * (T2 − T1 )/(T2 * T1 ))

This result shows that knowing ΔH alone allows us to calculate the acceleration factor between any two temperatures. Conversely, if we know the acceleration factor, we can calculate ΔH as follows: k ln

H =

T501 T502

1 1 − T1 T2

=

k ln ( AF12 ) 1 1 − T1 T2

=

T1T2 ln (AF12 ) . 11, 605(T2 − T1 )

Using spreadsheet formulas, we can calculate ΔH from

= (T1 * T2 * LN(AF12 )) /(11605 * (T2 − T1 ))

This last equation lets us estimate ΔH from two cells of experimental test data consisting of times to failure of units tested at temperatures T1 and T2. All we have to do is estimate a percentile such as T50 in each cell, then take the ratio of the two times as an estimate of AF12 and use this equation to estimate ΔH. This procedure is valid for any life distribution provided we have simple (linear) acceleration.

268

Applied Reliability

EXAMPLE 8.6 CALCULATION OF ACCELERATION FACTOR GIVEN ΔH What is the acceleration factor from a use condition temperature of 35°C to a laboratory test temperature of 125°C if ΔH is 1.0?

Solution Using a spreadsheet or a calculator, we get the value of 4976 for the acceleration factor when ΔH is 1.0. EXAMPLE 8.7 ESTIMATING ΔH FROM TWO TEMPERATURE STRESS CELLS OF DATA Use the 85°C and 125°C cells of data from Example 8.2 and their α estimates assuming a common slope (as shown in Figure 8.5) to calculate an estimate of ΔH.

Solution The MLEs for α at 85°C (358.15 K) and 125°C (398.15 K) are 12,270 and 229 hours, respectively (Figure 8.6). The equation for ΔH yields

H = [358.15 × 398.15 × ln(12270 / 229)]/(40 × 11605) = 122 . eV

N OT E : Making the same ΔH calculation using the (generally less accurate) LS probability plot estimates for α (without constraining slopes to be equal) yields a ΔH estimate of 1.48 eV. Going from a ΔH of 1.22 to a ΔH of 1.48 can make an enormous difference in acceleration factors. For example, the acceleration factor from 125°C to 25°C is 151,265 when ΔH is 1.22 and 1,921,392 when ΔH is 1.48—a difference of almost 13× in the acceleration! On the other hand, the constrained LS estimates shown in Figure 8.5 yield a ΔH estimate of 1.21—almost the same as the MLE calculation. EXERCISE 8.15 Use a spreadsheet to make a table of acceleration factors going from 125°C to 25°C for all values of ΔH from 0.01 to 2.00 in steps of 0.01. Verify that every 0.01 increase in ΔH results in a little over a 10% increase in the acceleration factor and a 0.1 increase in ΔH increases the acceleration factor by almost 2.66×. Repeat the exercise for T1 = 55°C and T2 = 155°C to see how much of a difference the two temperatures make.

8.9 Estimating ΔH with More than Two Temperatures In the last example, we used data from 85°C and 125°C in Example 8.2 to estimate ΔH. If we had used the MLEs from 85°C and 105°C, we would have calculated a ΔH estimate of [358.15 × 378.15 × ln(12270/1849)]/(20 × 11605) = 1.10. If we had used 105°C and 125°C, the estimate would be [378.15 × 398.15 × ln(1849/229)]/(20 × 11605) = 1.35. Which of these three estimates is the right (or best) one to use? The answer is none of them. We need a procedure that uses all the cells simultaneously to derive a ΔH estimate based on all the data, life distribution, and Arrhenius model. This can be done with MLE methods or graphical estimates using lines fitted by LS regression. The key equation for either approach is the Arrhenius model written in logarithmic form:

ln α = ln A + H (1/kT ).

269

Physical Acceleration Models

Looking first at how to do a graphical analysis, we note that this equation is linear in the (independent) variable {1/kT} and in the response or dependent variable (ln α), with slope ΔH and intercept ln A. So, the plotted data points (ln α,1/kT) from each cell should line up on approximately a straight line. The slope of this line is an estimate of ΔH. Slope (as discussed in Chapter 6) is calculated by taking any two points on the line and calculating the ratio of the difference in heights over the distance between the temperatures on the x-axis (or 1/kT axis). Units must be adjusted because of the logarithmic calculation, so that going one decade on the y-axis is equivalent to 2.3 units on the x-axis. Special graph papers are available (or can be constructed) where the y-axis is logarithmic and the x-axis is calibrated in terms of reciprocal temperature multiplied by Boltzmann’s constant. Once the points are plotted, a straight line through them represents a fitted Arrhenius model. The line becomes an instant calculator for estimating the α that corresponds to any temperature, by going up from the temperature to the line, and then to the left to the α value on the y-axis. The line should be obtained using LS regression on the log equation to avoid the subjectivity of visual fits. By looking at how well the points follow this straight line, we visually assess the Arrhenius model. We can use JMP, Minitab, or Excel to estimate the Arrhenius parameters and the lifedistribution shape parameters using either LS regression or MLE methods. Both types of analyses will be illustrated in Example 8.8. However, to use the graphical LS approach, we must first describe how LS regression can provide parameter estimates under the requirement of a common shape parameter β. EXAMPLE 8.8 ARRHENIUS MODEL ANALYSIS USING BOTH REGRESSION AND MLE METHODS Use the three data sets of Example 8.2 to obtain a graphical fit of the Arrhenius model and a regression estimate of ΔH. Then, repeat the analysis using the MLE method, replacing the parameter α by the Arrhenius equation a + b/kT (with a = ln A and b = ΔH). Do a LR test of the Arrhenius model (with equal shapes) versus three independent cells of Weibull data (no model or equalshape assumption). Is the Arrhenius model a reasonable assumption, based on the LR test results?

Solution The three sets of failure data were plotted in Figure 8.5 along with LS estimates. The regression estimates of α, under the assumption of a common slope β, regressing the transformed CDF estimate ln[ − ln(1 − Fˆ )] v versus ln(readout times), as explained in Example 8.2, for the 85°C, 105°C, and 125°C cells, were 12,055, 2,205, and 233 hours, respectively. For the Arrhenius equation, the independent variable 1/kT values are 32.40, 30.69, and 29.15, corresponding to the temperatures 85°C, 105°C, and 125°C, respectively. The corresponding dependent variable values for ln α are 9.40, 7.70, and 5.45, resepectively. The ΔH LS estimate is 1.21 and ln A = −29.6. The actual points fit the line very closely, as shown in Figure 8.21. Figure 8.22 shows part of the spreadsheet and the Accelerated Life-Testing platform entries needed for Minitab to use MLE to fit the Arrhenius– Weibull model to these data. Note that you need a column of temperatures (°C) in the spreadsheet and you input this column as an “accelerating variable” and select the Arrhenius Relationship in the “Accelerated Life Testing” platform. The Minitab output screens are shown in Figure 8.23. The temperature coefficient of 1.267 is the MLE of ΔH, and the (Weibull β) shape estimate across all the temperature cells (assuming the Arrhenius–Weibull model applies) is 0.718. Using JMP, we start with the data table in Figure 8.9. In the Fit Parametric Survival Platform dialog box, we add the Temperature column as a “Model Effect.” Then, we select Temperature in the Model Effects box and click on the red triangle to display a menu of choices, from which we

270

Applied Reliability

Plot of In(α) versus 1/kT 10

12055

In(α)

9 8

2205 y = 1.2077x − 29.617

7 6 5 29

233 29.5

30

30.5

1/kT

31

31.5

32

32.5

FIGURE 8.21 Arrhenius plot using LS estimates.

FIGURE 8.22 Minitab inputs for Arrhenius–Weibull fit.

choose Arrhenius. In the Model Effects box, Temperature is changed to Arrhenius(Temperature), as shown in Figure 8.24. Click Run Model to get the JMP MLE fit of the Arrhenius–Weibull model. The output is shown in Figure 8.25. Note that the estimates of ΔH (under Parameter Estimates, Arrhenius(Temperature)) and β (under Alternate Parameterization, β = 1/δ) are the same as those given by Minitab. Confidence intervals will differ because Minitab (Release 14 or 15) does not use the (usually more accurate) profile likelihood intervals used by JMP. The MLE estimates from all three (Minitab, JMP, and Excel) programs, although somewhat different, agree fairly closely. That is what one would expect from the use of different nonlinear equation solution algorithms and possibly different approximations for Boltzmann’s constant. After we fit an Arrhenius model, we should next test the hypothesis that that model (and the equal-shape parameters in each cell) is appropriate for the data. The L2 value (where parameters are estimated in each cell separately) was 196.788 (see text following Figure 8.8). The L1 value (assuming

271

Physical Acceleration Models

Estimation Method: Maximum Likelihood Distribution: Weibull Relationship with accelerating variable(s): Arrhenius

Regression Table Predictor Intercept Temperature Shape

Coef −31.4753 1.26749 0.718294

Standard Error 5.30248 0.176521 0.0747038

Z −5.94 7.18

P 0.000 0.000

95.0% Normal CI Lower Upper −41.8680 −21.0827 0.921513 1.61346 0.585836 0.880702

Log-Likelihood = −197.298. FIGURE 8.23 Minitab Arrhenius–Weibull analysis output.

the Arrhenius model and equal shapes) is 197.298 (see Figures 8.23 and 8.25). The chi-square test statistic for the (null) hypothesis that the model applies is therefore 2 × (197.298 − 196.788) = 1.02 (with 6 − 3 = 3 dfs). Assuming the model is valid, a test statistic this large will occur about 80% of the time—giving us no reason to reject the Arrhenius fit. In other words, the maximum likelihood is nearly the same with only three parameters as with fitting each cell separately using six parameters.

Now that we have analyzed the data from Example 8.2 and fitted an Arrhenius–Weibull model by several methods (LS estimates for the parameters of both the Weibull distribution and the acceleration model, MLE for Weibull parameters with LSE for the acceleration model, and MLE for both the distribution and acceleration models), a natural question is how well did we actually do? In a real application, we may not know the answer for years, if at all. In this case, however, all the data points were obtained by simulation from known Weibull distributions and a known Arrhenius equation. Thus, we can display the estimates from our analysis and compare them to the “real” values. This comparison is shown

272

FIGURE 8.24 JMP Arrhenius–Weibull analysis entry screens.

FIGURE 8.25 JMP Arrhenius–Weibull analysis results.

Applied Reliability

273

Physical Acceleration Models

TABLE 8.7 Summary of Arrhenius–Weibull Data Analysis Estimation Method Parameter β α at 85°C α at 105°C α at 125°C ΔH ln A

LS Estimates

ML Estimates and LS

MLE–Arrhenius

0.703

0.723

0.718

True Values 0.7

12,055 2205 233 1.21

12,270 1849 229 1.27

14,659 1670 237 1.27

11,424 1451 230 1.20

−29.60

−30.08

−31.50

−29.52

in Table 8.7. The results show that the LS and MLE estimates were comparable in this case, although MLE would be expected generally to give better results. The regression technique used to estimate acceleration model parameters was illustrated for Weibull data and the α parameter in Figure 8.21, but the methodology is really independent of any particular distribution. It can also be applied to the T50, or any other percentile. Since percentiles can be estimated, as described in Chapter 1, without knowing what the distribution is, it is not necessary to assume a form for the failure mechanism life distribution to fit an Arrhenius model. An analysis done without assuming any form of distribution is called a nonparametric analysis. What we gain from assuming a suitable distribution model is accuracy (provided the assumption is approximately valid) and the ability to project results to percentiles that cannot be directly observed because of the small sample sizes that reliability analysts typically work with. This kind of projection is frequently needed; hence, parametric models are the general rule. EXERCISE 8.16 Using whatever methods are available, confirm the estimates of the Weibull distribution parameters and the parameters of the Arrhenius acceleration model in Example 8.8.

8.10 Eyring Model The Arrhenius model is an empirical equation justified by the fact that it “works” in many cases. It lacks, however, a theoretical derivation and the ability to model acceleration when stresses other than temperature are involved. Example 8.5, with both temperature and voltage playing key roles, could not be handled by the Arrhenius model alone. The Eyring model offers a general solution to the problem of including additional stresses. It also has the added strength of having a theoretical derivation based on chemical reaction rate theory and quantum mechanics. In this derivation, based on work by Eyring and described by Eyring, Glasstones, and Laidler (1941), the parameter ΔH has a physical meaning. It represents the amount of energy needed to move an electron to the state where the processes of chemical reaction or diffusion or migration can take place. The Eyring model equation, written for temperature and a second stress variable, takes the following form:

H

T50 = AT γ e kT e

B+

C S T 1

274

Applied Reliability

H

B+

C

S1

AT γ e kT is the temperature term. The term e T models the effect of stress S1. If a second nonthermal stress is needed in the model, a third term similar in form to the second would be added to the equation. The Eyring model for temperature and two other stress variables, S1 and S2, is

H

T50 = AT γ e kT e

B+

C S T 1

e

D+

E S T 2

.

As before, T50 can be replaced by α or any other percentile or even the mean time to failure. It is informative to look at how the temperature term compares to the Arrhenius model. Except for the T γ factor, this term is the same as that of the Arrhenius model. If γ is close to zero or the temperature range over which the model is applied is small, the term T γ has little impact and can be absorbed into the constant A without changing the practical value of the expression. Consequently, it is likely that the Arrhenius model is successful because it is a useful simplification of the theoretically derived Eyring model. When we apply the Eyring model to life-test data with several stress variables, however, we often run into several difficulties. The first of these is the general complexity of the model. The temperature term alone has three parameters to be estimated—A, γ, and ΔH. Each additional stress term adds two more unknown constants, making the model difficult to estimate. As a minimum, we need at least as many separate experimental test conditions as there are unknown constants in the model. Preferably, we like to have several more beyond this minimal number, so that the adequacy of the model fit can be examined. Therefore, a three-stress model should have about 9 or 10 separate test cells. Obviously, designing and conducting an experiment of this size is not simple. Another difficulty is finding the proper functional form with which to express the nonthermal stresses. Temperature is in degrees Kelvin, but how should voltage or humidity be inputted, for example? The theoretical model derivation does not specify; so, the experimenter must either work it out by trial and error or derive an applicable model using arguments from physics and statistics. As an example, consider the temperature and voltage acceleration model given by

H

T50 = Ae kT V B

This expression may not look much like a two-stress Eyring model, but, in fact, it is. Make the substitutions γ = 0, C = 0, and S1 = ln V (where V is in volts), and the two-stress Eyring model reduces to the popular acceleration equation with three unknown parameters to be estimated. The voltage portion of the model follows what is often called the inverse power relationship, whereas the temperature portion follows the Arrhenius model. EXAMPLE 8.9 MLE ANALYSIS OF THE SIX-STRESS CELLS GIVEN IN EXAMPLE 8.5 Continue the analysis of the six-stress cells data given in Example 8.5 and find MLEs for the three unknown parameters in the acceleration model for temperature and voltage given above, and also the common lognormal shape parameter. Using the fitted acceleration model, estimate the probability of failure (i.e., the “population fraction failing”) during the first 40,000 hours at typical use conditions of 25°C and 4 volts. Give a confidence interval for this use CDF estimate. Use the LR test to show that the model fit is acceptable.

275

Physical Acceleration Models

Solution As in Example 8.8, begin by replacing the acceleration model equation with a linear form:

ln T50 = a + H/kT + B ln V . The unknown parameters in the model are a = ln A, ΔH, and the inverse power relationship voltage exponent B (B should be a negative number, justifying the inverse power relationship where lifetime decreases when voltage increases). To set up a JMP analysis, we start with the worksheet shown in Figure 8.26. Figure 8.27 shows the Fit Parametric Survival Platform entries needed for a JMP–MLE analysis. Note that we have

FIGURE 8.26 JMP worksheet for Arrhenius–power relationship–lognormal model.

276

Applied Reliability

FIGURE 8.27 JMP inputs for Arrhenius–power relationship–lognormal model.

FIGURE 8.28 JMP results for Arrhenius–power relationship–lognormal model fit.

applied transforms (Arrhenius to the Temp column and log to the Voltage column) in the Construct Model Effects box using the Transform red triangle menu. Figure 8.28 shows the JMP–MLE output. The Parameter Survival Fit options allow an analyst to input a use condition temperature value, a use condition voltage, and a time (40,000 hours), as shown in Figure 8.29. Here, we have added

Physical Acceleration Models

277

FIGURE 8.29 JMP dialog to estimate survival probabilities for Arrhenius–power relationship–lognormal model fit.

FIGURE 8.30 JMP survival analysis at possible use conditions.

a second temperature and voltage to show how the output turns out. Clicking on Go gives us the answers shown in Figure 8.30 for the projected Fuse(40,000) and Fuse(20,000) along with the upper and lower 95% confidence limits (change the “Alpha” 0.0500 value to get confidence intervals of other sizes; for example, use 0.10 for a 90% interval and 0.01 for a 99% interval). Minitab does the same analysis using built-in routines for the Arrhenius model, which allow adding a second voltage–power relationship term. We set up the same input spreadsheet as we did in Figure 8.26. Next, Figure 8.31 shows the input screen for the Accelerated Life Testing Platform, along with a graphical plot, Figure 8.32, of all the cells that includes the maximum like lihood line fits and MLEs. Minitab’s model parameters estimates are shown in Figure 8.33. Minitab’s projection for Ruse(40,000) = 1 − Fuse(40,000) at 25°C and 4 volts are displayed in Figure 8.34. To obtain predictions, enter values in Accelerated Life Testing Estimate box, as shown in Figure 8.35. For a spreadsheet solution to Example 8.9, see Appendix 8.7A. EXERCISE 8.17 Using whatever analysis programs are available, confirm the results in Example 8.9.

278

FIGURE 8.31 Minitab inputs for Arrhenius power law analysis.

FIGURE 8.32 Minitab plots for Arrhenius power law analysis.

Applied Reliability

279

Physical Acceleration Models

Censoring Information Count Right censored value 75 Interval censored value 225 Estimation Method: Maximum Likelihood Distribution: Lognormal Relationship with accelerating variable(s): Arrhenius, Ln

Regression Table 95.0% Normal CI Predictor Intercept Temp Voltage Scale

Coef Standard Error 0.941814 −7.83318 0.546789 0.0384551 0.203036 −1.06666 0.601847 0.0309698

Z −8.32 14.22 −5.25

P 0.000 0.000 0.000

Lower −9.67911 0.471418 −1.46460 0.544108

Upper −5.98726 0.622160 −0.668719 0.665714

Log-Likelihood = −414.367. Probability Plot (Fitted Arrhenius, Fitted Ln) for Start Anderson-Darling (adjusted) Goodness-of-Fit

At each accelerating level Level 85 85 105 105 125 125

12 16 12 16 8 12

Fitted Model 67.289 21.681 2.234 1.182 1.115 1.125

FIGURE 8.33 Minitab results of accelerated life-test analysis.

Table of Percentiles Standard 95.0% Normal CI Percent 50

Temp

Voltage

Percentile

Error

Lower

Upper

25

4

157937

78800.1

59400.6

419930

Table of Survival Probabilities 95.0% Normal CI Time

Temp

Voltage

Probability

Lower

Upper

40000

25

4

0.988751

0.753958

0.999947

FIGURE 8.34 Minitab results of accelerated life-test predictions.

8.11 Other Acceleration Models There are many other successful models, most of which are simple forms of the Eyring model. Several examples are shown next. A model known as the power rule model has been used for paper-impregnated capacitors under voltage acceleration. It has the form AV−B for the mean time to failure (or the

280

Applied Reliability

FIGURE 8.35 Minitab input box for accelerated life-test prediction.

T50, or the α parameter). This relationship is linear on log–log paper. This is similar to the model discussed in Example 8.8, without the temperature term. Another model for the effect of voltage is T50 = Ae−BV, called an exponential relationship. This relationship is linear on semilog paper. Humidity accelerates many failure mechanisms, such as those related to corrosion or ionic metal migration. The most successful models for humidity acceleration are T50 = A(RH)−B and T50 = Ae−B(RH), where RH is relative humidity. A successful model for electromigration failures includes current density J as a key stress variable:

T50 = AJ − n e

H/kT

.

Electromigration produces open failures in metal-based thin film conductors due to the enhanced movement of ions toward the anode at high temperature and current density. A typical ΔH range is 0.5−0.8 eV, while n = 2 is common. The lognormal life distribution adequately models the failure times, with σ normally in the 0.5−1.5 range. Models for mechanical failure, due to cracks and material fatigue or deformation, often have terms relating to cycles of mechanical stress, thermal cycling, or frequency of use. An example is the (modified) Coffin−Manson model used for solder cracking under repeated temperature cycling (as an electronic component is powered and unpowered). As published by Landzberg and Norris (1969), this model is given by

Nf = Af γ (1/ T )β G(Tmax )

where Nf is the number of cycles to a given percent of failure, f is the cycling frequency, ΔT is the temperature range, and G(Tmax) is a factor that depends on the maximum temperature reached in a cycle.

Physical Acceleration Models

281

A trial-and-error approach to model building, using something like a full Eyring and dropping insignificant terms based on fitting to data, should only be tried as a last resort. An experiment based on this approach is likely to be costly and unsuccessful. It is much better to have a model in mind before designing the experiment. This model would come from experience or from the literature on failure mechanisms. The simplest model that fits adequately should be used for as long as it matches experimental data and makes predictions that are not contradicted by experience.

8.12 Acceleration and Burn-In Up to now, we have considered high-stress testing and acceleration modeling as a means to obtain failure data and to predict product life at use conditions. There are two other important reasons why we might run components for a number of hours at high stress. The first reason is to monitor incoming lot reliability, accepting or rejecting lots based on the number of failures obtained from a sample put on high-stress test. In Section 11.15, we show how to use lot acceptance sampling techniques to reject lots that might have an unacceptably high field-failure rate. This use of a high-stress test monitor assumes that we know the acceleration factor between the use condition and the high-stress condition and that we know the life distribution, with only the scale parameter possibly changing from lot to lot. The second reason is a burn-in. Components are run at high stress prior to shipment or use in an assembly. This removes failures that might otherwise occur early in the field life of the product. This reduces the component failure rate throughout the product’s useful life. As opposed to using a sample to monitor reliability, burn-in is a manufacturing step applied to all components to reduce the failure rate to levels below those of non-burned-in components. In Chapter 10, we evaluate the benefits of burn-in using a sophisticated general algorithm for component reliability projection. Here, we look at some simple consequences of acceleration and burn-in. We focus on one failure mode and assume that a Weibull life distribution model fits the data and that the acceleration factor between use and the burn-in high-stress level is “a.” Let hbi(t) be the use failure-rate curve for the population after a burn-in of tbi hours, and let h(t) be the use failure-rate curve for the population without any burn-in. Then, hbi(0) = h(atbi) and the failure rate at time t is given by hbi(t) = h(t + atbi). Note that the failure-rate curve after burn-in is just the continuation of the old pre-burn-in failure-rate curve starting at atbi hours. This is true because the failure rate is, by definition, a “conditional” failure rate for the survivors at time t. The equations for f bi(t) and Fbi(t) are more complicated, requiring conditional probabilities that account for having survived the burn-in period:

Fbi (t) = { F(t + atbi ) − F( atbi )}/{1 − F( atbi )} fbi (t) = f (t + atbi )/{1 − F( atbi )}

What does all this mean when F(t) is a Weibull distribution with shape parameter β? If β < 1, then h(t) is a decreasing function and hbi(t) is always less than the pre-burn-in failure rate. With the proper choice of burn-in stress and time, the post-burn-in failure rate can be reduced to meet any given requirement. If β = 1, then h(t) = λ and hbi(t) = λ. This means that there will be no gain from burn-in at all. Even worse, if β > 1, then hbi(t) > h(t), and the failure rate is higher after burn-in. In Chapter 10, we shall see, however, that burn-in can be effective even for increasing h(t) if there is a defective subpopulation.

282

Applied Reliability

EXAMPLE 8.10 CALCULATING NEEDED BURN-IN TIME A component population fits a Weibull life distribution with β = 0.8 and α = 2.582 × 108. The average failure rate (AFR) for the first 3 months of field use (estimated to be 720 power-on-hours) is projected to be

H(t )/t = (t /α )m /t = (720/c )m /720 = 50 FITs. However, the customer requirement is 20 FITs for the first 90 days. A burn-in is necessary, and the highest practical burn-in temperature for this component is 105°C (with use temperature 25°C, and ΔH = 1.0). How long a burn-in period is needed?

Solution The acceleration factor from 25°C to 105°C for ΔH = 1.0 eV is 3767. The average failure rate after burn-in (in FITs) is given by

H(at bi + t ) − H(at bi ) = 10 9 × t

3767t bi + 720 2.582 × 10 8

0.8

− 720

3767t bi 2.582 × 10 8

0.8

.

By trial and error, we find that a burn-in time of tbi = 6 hours gives us an AFR of 20.01 FITs, which is the failure rate desired after burn-in. EXERCISE 8.18 For the component population in Example 8.10, how long a burn-in at 105°C is needed to reduce the first 90-day AFR to about 15 FITs (to the nearest half-hour)? What would the first 90-day AFR be if a 1-hour burn-in is chosen?

One important and somewhat puzzling consequence of burn-in should be noted. The failure rate of the original unburned-in population in Example 8.10 would show a rapidly decreasing failure rate in the field. A typical field-monitoring program might involve calculating the empirical cumulative hazard function H(t) and plotting it versus time on log versus log paper. The slope of a line fitted through the H(t) points is an estimate of β and, with enough field data, we would expect this slope to be close to 0.8. People familiar with Weibull cumulative hazard plotting would take comfort in the knowledge that a slope less than 1 implies a decreasing failure rate; so, the field failure rate will improve with time. But what does the cumulative hazard plot look like for field data from the burned-in population? True, the failure rate starts out much better than it did for the original population. However, the slope of the cumulative hazard plot is approximately one, implying that the failure rate is no longer decreasing. It seems as if the burn-in has converted the population into following an exponential life-distribution model. The population life distribution has, however, not actually changed from a Weibull with shape 0.8 to an exponential. It only appears that way because we are plotting the H(t) starting at atbi versus a time axis starting at time 0. The function

H bi (t) = H (t + atbi ) − H ( atbi )

is no longer a Weibull cumulative hazard when plotted on log–log paper. Its slope starts outas approximately one and eventually asymptotes to 0.8. In general, for any value of β and equivalent burn-in time of atbi, the slope of the cumulative hazard plot will start near one and gradually approach β. In practice, the slope is likely to remain close to one

Physical Acceleration Models

283

throughout the field life of the components. The rapid failure rate decrease takes place during the burn-in, and little visible improvement occurs in the field lifetime. EXERCISE 8.19 Compute the cumulative hazard function for the burned-in components in Example 8.10 at 200, 400, 600, and 720 hours of field use. Use LS to fit a line through these points plotted on log–log paper. Estimate the slope of the line (i.e., fit the model Hbi(t) ≈ A + β ln t using regression to obtain β). You should obtain a slope greater than 0.998. Estimate the slope again after 100k field hours using the cumulative hazard at 100,200; 100,400; 100,600; and 100,720 hour points. Show that it is now nearly 0.88. Finally, show that after 5,000,000 hours of time, the slope falls to less than 0.81.

8.13 Life Test Experimental Design Another important topic related to the accuracy of parameter estimates is the proper design of the experiment. By design, we mean the choice of stress levels and their sample sizes. This is complicated for life testing data, since the number of failures in a stress cell is not known in advance and accurate estimation depends much more on the number of failures than on the sample sizes put on test. The basic design theory, in the cases other than life tests, says that best results are obtained by choosing stress levels as far apart as possible. In the life-test case, this must be tempered by two considerations: (1) stresses too high may introduce new failure modes and violate simple acceleration, and (2) stresses too low may not yield any failures. Consequently, determination of the stress level and sample size often becomes more of an art, based on experience and feel, than an exact science. Some analytical approaches are possible, however. Two of these are as follows:

1. Guess reasonable values for the unknown parameters. Using these values, pick sample sizes and stress levels that will produce an adequate number of failures in each cell. Typically, at least 10 and preferably more than 20 failures in each cell is adequate. With readout data, failures must be spread out over three or more intervals. Your assumed values will generally come from typical literature ΔH’s and shape parameters. A scale constant such as the T50 at use is calculated by assuming that the population just meets its use condition failure-rate objective. 2. Use the same unknown parameter value guesses as in the first approach to simulate (possibly censored) cells of failure data at each test condition for a given experimental design and then analyze the data. Iterate this simulation this many times to get a feel for the precision of the estimates. Try again with a different design and see what accuracies are obtained. By trying various designs, you will eventually determine a good design.

Both methods rely on a good preguess for the true model parameters. Kielpinski and Nelson (1975) have carried this approach even further for the lognormal, and Meeker and Nelson (1975) have studied the Weibull. They have derived optimal life-test schemes assuming exact time-of-failure data and preguessed acceleration model and distribution

284

Applied Reliability

shape parameters. An extensive list (159 references) of accelerated test plans is given in Nelson (2005). Example 8.11 shows how a method 1 analysis might be carried out. EXAMPLE 8.11 LIFE TEST EXPERIMENTAL DESIGN We will test a new component to assess whether it meets a use average failure-rate objective of 0.001%/K over 100,000 hours. The failure mode is temperature dependent and modeled by the Arrhenius relationship. The use temperature is 25°C. We propose testing at 65°C, 85°C, and 125°C for 2000 hours. How many units should be tested in each temperature cell, and are our choice of temperatures and test time reasonable?

Solution Assume that a lognormal distribution is appropriate for the failure mode. Also assume the pre guesses for σ and ΔH are 1 and 0.7, respectively. The average failure rate requirement implies that F(100k) is approximately 0.001. This allows us to calculate a use T50 from T50 = 100, 000 e − σΦ−1(0.001) = 100, 00 e − Φ−1(0.001) = 2,198, 840 hours

where Φ−1 is the inverse of the standard normal distribution and can be looked up in normal tables (or use the spreadsheet function NORMSINV). Now, we use the ΔH value of 0.7 eV to calculate the acceleration factors between 25°C and 65°C or 85°C or 125°C. These turn out to be 25, 96, and 937, respectively. This leads to T50 values of 87,600; 22,900; and 2,347 hours for these temperatures. With such a high T50 at 65°C, it is very unlikely we would see any failures at this temperature unless we put many thousands of units on test. At 85°C, if we go out to 5000 hours, we would expect 6% of the units to fail. So, we modify our design to start at 85°C and place 200 units on test in this cell for 5000 hours. Since we decided against the 65°C cell, we could add a cell at 100°C. The acceleration from 25 to 100°C is 239; so, the T50 is 2,198,842/239 = 9200 hours. This temperature will have 27% failures expected at 5000 hours; so, a sample size of 100 is adequate. Finally, we calculate that at 125°C, an expected failure rate of 44% will occur in 2000 hours. Thus, 100 units on test for only 2000 hours will suffice. Note that if the guessed values were close to the true parameter values, the original test plan design would have yielded fewer failures and the parameter estimates would have poor accuracy. An easy calculation ahead of time corrected the situation.

8.14 Summary Many important concepts and techniques were introduced in this chapter. Some were surveyed briefly, whereas others were developed in detail, by means of examples. These ideas should be studied carefully, for they are typical of situations often encountered in the analysis of life-test data. The theory of acceleration followed from equating “simple” acceleration with a linear change in the time scale. From this, it followed that lognormal or Weibull shapes remained constant between stress levels. Acceleration factors can be calculated by taking the ratios of T50s at two stress levels. Instead of T50, any other percentile, or even the mean time to fail, could be used. Acceleration models describe how the lifetime changes as a function of stress. These models can be set equal to sample estimates of some convenient percentile, to solve for the unknown model parameters and project to a low-stress application.

Physical Acceleration Models

285

The following seven steps describe how an acceleration model study might be carried out:

1. Choose as simple an acceleration model for the failure mechanisms under investigation as seems appropriate. Models come from past experience, a literature study, or a theoretical derivation based on the kinetics of failure. Many models have an Arrhenius temperature term and are simplified Eyring models. 2. Choose an appropriate life distribution for the failure mechanism to model the spread of failures over time. 3. Design an experiment consisting of enough different-stress cells to estimate the model parameters. Make sure that the sample sizes and stress levels are such that each cell has an adequate number of failures. More than 10 failures, spread out over at least three readout intervals, is a rule of thumb for adequate data from a cell. An assessment of proper sample sizes and stresses can often be made by guessing reasonable model parameters and working backward from the use failure-rate objective. 4. Analyze the failure data in each cell with as accurate a technique as is available. However, look at a graphical analysis as well, to visually see how well the data fits the life-distribution model and follows the equal-slope consequence of simple acceleration. 5. Fit the acceleration model parameters, again using the best technique available. Test the adequacy of your model using the LR test. 6. Use survival analysis to project use condition CDF values or substitute the estimated model parameters into the model equation, along with use stresses, to project a T50 or α at use conditions. The use shape parameter is the same as the best single value fitting all the stress cells. 7. Calculate the confidence limits for the critical model parameters and other quantities of interest (e.g., percentiles, cumulative failures at end of warranty or at end of life).

Appendix 8A 8.1A An Alternative JMP Input for Weibull Analysis of High-Stress Failure Data We analyze the Table 8.2 data using the JMP data table shown in Figure 8.9 without adding the indicator variable columns shown in Figure 8.12. Open the Fit Parametric Survival screen and put in the inputs shown in Figure 8.1A. Note again that the Temp column should be a numeric-ordinal format column. Click Run Model and you get the output shown in Figure 8.2A. JMP is fitting the following model: ln α 85 = intercept

ln α 105 = intercept + Temp[105 − 85] Estimate ln α 125 = intercept + Temp[105 − 85] Estimate + Temp[125 − 105] Estimate

The MLE calculations for the three α values shown in Figure 8.2A use these model equations along with the assumption of equal slopes for all temperatures.

286

Applied Reliability

FIGURE 8.1A JMP inputs for analysis of the data in Table 8.2.

L = −log likelihood = 197.160

ˆ 85 = e9.41494271 = 12,270 α ˆ 105 = e9.41494271 − 1.8923836 = 1,849 α ˆ 125 = e9.41494271 − 1.8923836 − 2.0888863 = 229 α βˆ = 0.723

FIGURE 8.2A Analysis results for the data in Table 8.2 (MLEs assuming equal slopes).

287

Physical Acceleration Models

8.2A Using a Spreadsheet for Weibull Analysis of High-Stress Failure Data EXAMPLE 8.1A WEIBULL LIKELIHOOD EQUAL-SHAPES TEST Example 8.3 used Minitab to carry out an LR test to decide whether it is reasonable to assume equal shapes for the three different temperature cells of data in Example 8.2. Show how this calculation can be carried out using an Excel spreadsheet and Excel Solver.

Solution The Excel spreadsheet in Figure 8.3A has been set up for Solver to calculate separate MLEs for each data set and the corresponding value of L2 (in the spreadsheet cell H2). This spreadsheet is available as Ex8.3A.xlsx from the “Downloads & Updates” tab on the web page for this book at http://www.crcpress. com/product/isbn/9781584884668. You can create your own version of Ex8.3A.xlsx by entering the Table 8.2 data into Genweibest.xls, as described in Chapter 4. Add starting guesses for ln α and β in A2 and B2, respectively, then pull down the formula in G2 to G12. For the second cell of data, the formula in G12 has to be adjusted to refer to new scale and shape parameters (starting values placed in A12 and B12). Then, pull down to row 22. Finally, for the third cell of data, the formula in G22 has to be adjusted again to reference new scale and shape parameters (starting values placed in A22 and B22). Then, pull down to row 31. All the new parameter locations have to be added to Solver (see Figure 8.3A). Run Solver to obtain the MLEs and value of L2 (in H2) shown in Figure 8.3A. The calculation for MLEs of parameters and L1, assuming equal-shape parameters for all three cells, is nearly the same. However, only one starting shape value is needed for input in B2 and this is used in every row for the G-column likelihood formula. The corresponding spreadsheet is in Figure 8.4A. This spreadsheet is available in Ex8.4A.xlsx.

FIGURE 8.3A Excel spreadsheet for calculating MLEs of individual cells.

288

Applied Reliability

FIGURE 8.4A Excel spreadsheet for calculating MLEs assuming a common shape.

N OT E : The template GenWeibcellsest.xls offers an easier way to estimate parameters for up to six

cells of Weibull failure data under the assumption of equal shapes—see Section 8.8A. The Solver inputs needed to obtain the MLEs of the parameters and the L1 value are also shown in Figure 8.4A.

8.3A Using a Spreadsheet for MLE Confidence Bounds for Weibull Shape Parameter EXAMPLE 8.2A CONFIDENCE BOUND CALCULATION FOR A COMMON WEIBULL SLOPE Use an Excel spreadsheet to find an approximate 95% profile likelihood confidence interval for the common Weibull slope β for the data in Table 8.2.

Solution For the spreadsheet calculation, we start with the spreadsheet shown in Figure 8.5A. The reference target value for L1 is 197.159651 + 1.92073 = 199.0804. The calculation steps leading to the upper bound are as follows: Step 1: Open Goal Seek from the Tools menu and put in the inputs shown in Figure 8.11. Press OK and β changes to 0.85604.

Physical Acceleration Models

289

FIGURE 8.5A Using Goal Seek confidence bound calculations.

Step 2: Now open Solver and make the inputs as shown in Figure 8.6A. Click Solve to obtain the best values for the αs that correspond to the new β. These turn out to be 8767.47, 1718.73, and 253.92. We have completed the first iteration. Step 3: The second iteration repeats the steps of the first iteration starting with the new spreadsheet obtained at the end of Step 2. Open up Goal Seek and run the routine using the same entries as shown in Figure 8.5A. The value in B2 for β changes to 0.87828. Open up Solver and run it with the same entries as shown in Figure 8.6A. The new values for the αs are 8372.45, 1702.40, and 258.19. Step 4: The third iteration begins by running Goal Seek again and obtaining a value for β of 0.8787. Running Solver gives new αs of 8364.63, 1702.07, and 258.30. This is the last iteration needed, since running Goal Seek with these αs does not change the value obtained for β. The upper limit of a 95% confidence interval for β is 0.879. Step 5: In this example, Goal Seek moved β from the MLE value to the upper confidence limit in three iterations. We now need to find the lower confidence limit for β. This will be a value smaller than the MLE of .723, about as far away from .723 as the upper bound is in the other direction. So, we change B2 to .723 −.156 = .557 as a starting value and repeat Steps 1 through 4 above. Again, only three iterations will be needed and the value that β converges to is .5830. A 95% confidence interval for β is therefore (.583, .879). Note that the spreadsheet interval is identical to the JMP interval and required only three iterations for convergence. There is no guarantee, however, that either the JMP routine or the spreadsheet iterations will converge in every case—this will depend on the likelihood surface and the choice of starting estimates for the unknown parameters. If the spreadsheet calculations produce estimates that appear to be wandering aimlessly, try guessing better starting values and/or putting more restrictions on the parameters when using Solver. It is also possible you will get slightly different numbers than those shown here when you try this exercise, depending on your starting values.

290

Applied Reliability

FIGURE 8.6A Solver example for confidence limit calculations.

But, the final confidence intervals should be approximately the same. If Solver seems to converge but one or more of the G-column likelihood values are listed as “#NUM,” it means that the solution values are meaningless. Then, you either need better starting values or you need to input boundary conditions so that solver does not wander off into regions where the likelihood is undefined. If you use JMP (release 6 or earlier), the Survival Parametric Platform only provides 95% confidence intervals and you cannot input different starting values if the routine does not converge. The JMP Nonlinear Fit platform does let you choose the confidence level and starting parameter values, but you need to program the appropriate negative log LIK formulas for your data and model. Minitab 14, on the other hand, does not use the profile likelihood method. Minitab calculates confidence intervals using a generally less accurate approximate normality approach. EXERCISE 8.1A Use the spreadsheet method to calculate 90% confidence intervals for all the α parameters for the data in Table 8.2 (assuming a common β).

8.4A Using a Spreadsheet for Lognormal Analysis of the High-Stress Failure Data Shown in Table 8.5 Figure 8.7A shows the first two data sets and part of the third set in a spreadsheet for this analysis (Ex8.5.xls). The spreadsheet entries and the pull-down construction of the “G” formula column are similar to those in the solution of Example 8.2. This worksheet was constructed from the Genlnest.xls template by inputting the data and following the column heading instructions.

Physical Acceleration Models

291

FIGURE 8.7A Excel spreadsheet for calculating MLEs of individual cells.

The equal-sigma analysis is done the same way, except that the fixed references to the σ parameter stay $B$2 for every row. The results are shown in Figure 8.8A. The worksheet (Ex8.5SameSigma.xls) has the formulas. Note: The template Genlncellsest.xls offers an easier way to estimate the parameters for up to six cells of lognormal data, assuming equal σs. 8.5A Using a Spreadsheet for MLE Confidence Bounds for the Lognormal Shape Parameter As in Example 8.5, we can also use a spreadsheet to calculate the 95% confidence intervals for the parameters. We show the calculation for σ. Figure 8.9A shows the spreadsheet as it would look just after calculating the MLEs and opening Goal Seek to search for an extreme 2 σ value in the 95% likelihood region. The target is L1 + 0.5 χ1;.95 = 412.74856 + 1.92073 = 414.66929. This value is put in cell J2 for Goal Seek to achieve. Figure 8.10A shows the Solver screen that is used for every new Goal Seek calculation. For this example, it took three iterations for the σ upper limit to converge to 0.6692. With the σ value in B2 changed to 0.53, it takes another three iterations for the lower bound to converge to 0.5466. The resulting 95% confidence interval of (.547, .669) is the same as the JMP result. Note, however, that Minitab calculates approximate normal limits and gives a slightly different interval (0.545, 0.667).

292

FIGURE 8.8A Excel spreadsheet for calculating MLEs of same-sigma cells.

FIGURE 8.9A Spreadsheet confidence bound calculation for common sigma.

Applied Reliability

Physical Acceleration Models

293

8.6A Using a Spreadsheet for Arrhenius–Weibull Model We start with the spreadsheet template GenWeibtempest.xls. After inputting the data from Table 8.2 and following column heading instructions, we have the Excel spreadsheet shown in Figure 8.11A. Figure 8.11A also shows the Solver inputs that lead to the converged values on the spreadsheet.

FIGURE 8.10A Solver screen for confidence bound calculation.

FIGURE 8.11A Excel spreadsheet fit for Arrhenius–Weibull model.

294

Applied Reliability

EXERCISE 8.2A Set up the spreadsheet in Figure 8.11A and use Goal Seek and Solver and the iterative spreadsheet confidence interval technique shown in Examples 8.3A and 8.5A to derive the 95% confidence intervals for β and the Arrhenius equation constants. Note: Your intervals should be nearly the same as those in Figure 8.25.

8.7A Using a Spreadsheet for MLEs for Arrhenius–Power Relationship Lognormal Model Acceleraterd life test MLEs can be easily calculated using a spreadsheet and Solver. Figure 8.12A shows part of the spreadsheet and the Solver inputs for an Arrhenius–power relationship analysis of the data in Table 8.5. The spreadsheet template for a lognormal Arrhenius/voltage power relationship analysis can be downloaded (Genlntempvolt.xls). After downloading the template, remove the protection status. Input the six sets of data and the temperatures and voltages and put starting-value guesses in A2, B2, C2, and D2. Then, run Solver to get the output in Figure 8.12A. If you want to fit a Weibull distribution to the same or similar temperature/voltage data, download GenWeibtempvolt.xls. As a confirmation check, note that the converged values for the parameters in cells A2, B2, C2, and D2 are almost the same as the solutions obtained by Minitab and JMP. It is also possible to obtain a 95% confidence interval for Fuse(40,000) using Solver, Goal Seek, and a useful trick. You have to reparameterize the model parameters so that Fuse(40,000) becomes a new “parameter,” replacing the “a = ln A” parameter in the model.

FIGURE 8.12A Spreadsheet calculation of Arrhenius–power relationship model parameter estimates.

295

Physical Acceleration Models

Start with the identity p = Fuse ( 40, 000) = Φ([ln( 40, 000) − ln T50 ]/σ )

=Φ

ln( 40, 000) − a +

H * 11605 + B ln V T + 273.15

σ

where Φ is the standard normal CDF. Next, apply the inverse function Φ−1 to obtain

Φ −1 { Fuse ( 40, 000)} =

ln( 40, 000) − a +

H * 11,605 + B ln V T + 273.15

σ

Solving for a yields

a = ln( 40, 000) − σ * Φ −1 { Fuse ( 40, 000)} −

H * 11, 605 − B ln V T + 273.15

Wherever a appears in the spreadsheet likelihood formula, it will be replaced by the last equation with a use temperature of 25 for T and a use voltage of 4 for V. Also, add a parameter column N2 labeled Fuse(40,000) with a guess for a starting value, say 0.01. A2 becomes a calculated column, using the formula

= LN(40000) − NORMINV($ N $2 , 0, 1) * $D $2 − (($B$2 * 11,605)/298.15) − $ C$2 * LN(4).

The new spreadsheet (Figure 8.13A.xlsx), along with the Solver inputs for finding MLE solutions for the new parameters, is shown in Figure 8.32. After running Solver, the projected use CDF at 40,000 hours is 0.01125, the same as that given by JMP and Minitab.

FIGURE 8.13A Spreadsheet for calculating use CDFs and confidence bounds.

296

Applied Reliability

FIGURE 8.14A Goal Seek and Solver inputs for calculating profile likelihood limits.

A 95% confidence interval for Fuse(40,000) is obtained by running Goal Seek. Vary N2 to obtain the negative LIK sum of 414.36697 + 1.92074 = 416.2877 for L2. Then, run Solver with just the parameters in B2, C2, and D2 to get a new L2. Iterate several times and N2 will converge to 0.243. Repeat, choosing a new starting value of L2 below 0.01125, say 0.0001. Eventually, the value in L2 should converge to 0.00005. The profile likelihood interval for Fuse(40,000) is therefore (0.00005, 0.246). The Goal Seek and Solver inputs for each iteration are shown in Figure 8.14A. We use the LR test to check the model adequacy with L2 = 410.939 (see Example 8.5 and Table 8.5) and L1 = 414.367 (labeled “Full” by JMP in the output shown in Figure 8.27). The chi-square test statistic is 2 × (414.367 − 410.939) = 6.856, with 12 − 4 = 8 dfs. A number this large will occur 55% of the time when the model applies; thus, there is no reason to reject the model fit. EXERCISE 8.3A Set up the spreadsheet in Figure 8.12A (Figure 8.12.xls). Use Goal Seek and Solver and the iterative spreadsheet confidence interval technique in Example 8.5 (Figures 8.8A and 8.9A) to derive the 95% confidence intervals for σ and the Arrhenius/Voltage Power relationship constants. Note: Your intervals should be nearly the same as those in Figure 8.8A.

8.8A Spreadsheet Templates for Weibull or Lognormal MLE Analysis Table 8.1A lists the templates that can be downloaded from the publisher’s website at http://www.crcpress.com/product/isbn/9781584884668 and shows what kinds of analyses they are set up for. The templates have column headings that describe how to use them. NO T E :

When first downloaded, these templates may be in a locked or protected state. This means that data cannot be analyzed until the spreadsheet is unprotected. Use help for your version of Excel to find out how to unprotect a worksheet.

297

Physical Acceleration Models

TABLE 8.1A Spreadsheet Templates Template

Use

GenWeibest.xls

One sample of Weibull Data MLE calculations

GenWeibcellsest.xls GenWeibtempest.xls GenWeibtempvoltest.xls Genlnest.xls Genlncellsest.xls Genlntempest.xls Genlntempvoltest.xls

Several cells (one shape parameter) Weibull MLEs Weibull–Arrhenius Weibull–Arrhenius + Power Relation for Voltage One sample of lognormal Data MLE calculations Several cells (one shape parameter) lognormal MLEs lognormal Arrhenius lognormal Arrhenius + Power Relation for Voltage

Problems 8.1 A population of components has a CDF model given by F(t) = 1 − k/(k + t)}. Assume there is linear acceleration between the testing stress and the use stress, with acceleration factor a. Write the CDF for the life distribution at use conditions. 8.2 Make up a table of acceleration factors for every 5° increase in temperature from a starting temperature of 25°C up to 125°C and for ΔH values from 0.5 to 1.5 in steps of 0.1. At what value of ΔH does the acceleration roughly double for every 10°C increase in temperature? At what value of ΔH does the acceleration roughly double for every 5°C increase? 8.3 A component, tested at 150°C in a laboratory, has an exponential distribution with an MTTF of 3000 hours. Typical use temperature is 45°C. Using an acceleration factor of 150 between these two temperatures, what is the use failure rate and what percentage of component failure is expected during the first 40,000 hours? What is ΔH? 8.4 A component, based on laboratory test results at 150°C, is modeled using a Weibull distribution with characteristic life of α = 1000 hours and a shape parameter of β = .5. Typical use temperature is 55°C. Assuming an acceleration factor of 250, what is the expected average failure rate over the first 40,000 hours? What fraction of the components should typically survive 40,000 hours? Calculate ΔH for these components. 8.5 A component, running at 125°C in the laboratory, is modeled using a lognormal life distribution with a T50 of 15 hours and a σ value of 1.0. Using an Arrhenius model with an activation energy of 1.3 eV, find the acceleration factor for a use condition of 45°C. What fraction of the components will be expected to survive up to 20,000 hours? 8.6 Using the electromigration model described in Section 8.11, calculate an acceleration factor between a stress test run at 150°C and a current density of 500,000 amps/sq cm and a field condition of 35°C and 100,000 amps/sq cm. Use an activation energy of 0.5 eV and a current density exponent of two. Using a lognormal failure distribution with a σ of 2.5 and a T50 at stress of 2500 hours, estimate the fraction of units expected to fail by 20,000 hours at field conditions.

298

Applied Reliability

8.7 Show how, by a proper choice of stress functions and parameter values, the general Eyring equation for two stresses can be converted into the electromigration model. 8.8 Four cells of components were put on test at temperatures of 110°C, 120°C, 130°C, and 140°C to model a failure mechanism with an Arrhenius acceleration model. The lowest temperature cell had 100 units on test; all other cells contained 50 units. Readouts were taken at 24, 48, 96, 150, 500, and 1000 hours. The test ended at the last readout, and the data are summarized below. Readout Hours

110°C Cell New Failures

120°C Cell New Failures

130°C Cell New Failures

140°C Cell New Failures

0 0 0 0 3 1 96

0 0 2 0 3 3 42

1 1 3 1 3 8 33

2 1 3 1 10 10 23

24 48 96 150 500 1000 Unfailed

Use a Weibull life distribution and estimate β, ΔH, and the use CDF at 40,000 hours at a use temperature of 60°C. Repeat the analysis using a lognormal distribution. Which distribution appears to fit the data better? Use MLE fits and look at probability plots. For both the lognormal and the Weibull fit, give 95% profile likelihood confidence intervals for the shape parameter, ΔH, and the use CDF at 40,000 hours. 8.9 To model a failure mechanism that is accelerated by both temperature and voltage, six cells, each with 50 components, were run at different temperature and voltage combinations. Readouts were taken at 24, 48, 168, 500, 1000, and 2000 hours, and the numbers of failures were recorded. The test ended at 2000 hours. The data are summarized below. Readout Hours

8 V, 125°C Failures

10 V, 115°C Failures

10 V, 125°C Failures

12 V, 95°C Failures

12 V, 115°C Failures

12 V, 125°C Failures

24 48 96 500 1000 2000 Unfailed

2 0 4 6 9 9 20

0 0 0 3 9 10 28

1 0 2 10 16 13 8

0 0 0 0 2 3 45

0 0 3 6 9 18 14

0 0 5 15 20 8 2

Use the Arrhenius–inverse power model (AI model) in Example 8.9 and a lognormal life distribution. Estimate ΔH and B and the use CDF at 100,000 hours for use conditions of 2 V and 25°C. Repeat the analysis using a Weibull life distribution. Which distribution appears to fit the data better? Give 95% profile likelihood confidence intervals for these parameters for both the lognormal and the Weibull fits.

299

Physical Acceleration Models

8.10 Write the general expression for the field cumulative hazard function of a Weibull population after a burn-in equivalent to tbi field hours. Show that a plot of the cumulative hazard function on log–log paper has a slope that approaches one as the field time approaches zero. Show also that the slope approaches the original Weibull shape parameter β, as time approaches infinity. 8.11 Components made by a certain manufacturer have a failure mechanism modeled with a lognormal distribution. Typical use condition is 35°C. An accelerated test was run with these components using three temperature cells: 175°C, 150°C, and 125°C. Plots of the data at each temperature supported a lognormal distribution with a common σ of 3.9. The T50 estimates for the three cells were, respectively, 75, 150, and 500 hours. The corresponding values of (1/kT) are 25.89, 27.42, and 29.15. Plot lnT50 versus 1/kT. Determine the slope and intercept to get the activation energy and proportionality constant. Estimate the field usage T50 and the fraction failing by 30,000 hours. 8.12 A population of components has a failure mechanism modeled with an exponential distribution. Typical use temperature is 45°C. The failure mechanism is accelerated by increasing the radiation level. A sample of 100 units was tested under a radiation dosage 10 times the normal use dosage. A model for the mean time to failure is

MTTF = AD−α where A is a proportionality constant, D is the radiation dosage, and α is an empirically estimated constant equal to 1.5. If the high-stress failure rate was found to be 1.75%/K, what is the MTTF estimate under normal use conditions?

This page intentionally left blank

9 Alternative Reliability Models This chapter discusses several models that, combined with the acceleration modeling methods of Chapter 8, can be very useful for designing reliability life test experiments and analyzing reliability test or field data. Sections 9.1 and 9.2 deal with step stress experiments and degradation data analysis. These approaches extend the methods of Chapter 8 to situations where the analyst is uncertain of obtaining sufficient failure data at the stress levels chosen for experimentation or where time and equipment exclude the multistress cell experiments described in Chapter 8. Section 9.3 extends the maximum likelihood estimate (MLE) modeling techniques used in Chapter 8 to situations where the analyst needs to determine which of one or more categorical factors are influencing the T50 or α scale parameters. For example, is vintage a significant factor and how by much does it change lifetime? Are there significant differences in lifetime based on the factory of manufacture or the vendor from which a component was procured? Actions based on answers to questions like these can have significant financial relevance, and as the software examples will show, lifetime regression provides a powerful tool for obtaining these answers. Section 9.4 looks at a model that is very popular in biomedical survival analysis modeling: the proportional hazards model. Section 9.5 explores data analysis when the failures seem to stop occurring long before all the units on test (or observed in the field) have failed. Are there subpopulations which are defective in some way, causing some units to fail relatively early, while the rest of the (nondefective) units continue to operate until they reach their far-off wear-out times? These situations can be modeled using defect subpopulations. The accelerated life models of Chapter 8 can still be applied to the defect subpopulation failure data. Based on the authors’ extensive industrial experience, defect models are often appropriate and may be the most neglected and underused analysis approach in the literature.

9.1 Step Stress Experiments The step stress technique for life testing insures enough failures occur by periodically increasing the operating stress within a cell until a large number of the units have failed. The primary drawback to changing stresses by steps, while an experiment is running, is the difficulty of analyzing the resulting data and constructing appropriate models. This section will describe a way of carrying out accelerated testing model analysis using only one cell of units and periodically increasing that cell’s operating temperature (and other stresses, if applicable). The method relies on repeated application of the concept that changing stress is equivalent to a linear change in the time scale. If we knew the value of ΔH, we could calculate acceleration factors that would convert time intervals at various temperatures, to equivalent time intervals at one fixed reference temperature. The step stress experiment would then be reduced to a single stress experiment, with artificial, calculated readout time 301

302

Applied Reliability

Temperature = T1 Temperature = T2 Fails = r1

R1

Fails = r2

R2

Temperature = Tk1 Temperature = Tk Fails = rk1 . . . . . . . R

Fails = rk

k1

Time Rk = End of test

FIGURE 9.1 Arrhenius step stress data schematic.

points replacing the actual readout times (of course, as in any type of accelerated testing, the usual caution applies: you have to be careful that the increased operating stress levels do not introduce new failure mechanisms). To show how we reduce step stress data to equivalent single stress data assume the actual readout times are R1, R2, … , Rk and the test ends at Rk. At each readout time, new failures are recorded and the experimenter has the option of increasing the operating temperature of the cell. Let the temperature during the interval 0 to R1 be T1. The temperature between R1 and R2 is T2, and so on, with the final (highest) temperature Tk occurring during the kth interval between times Rk−1 and Rk. At readout time Ri, the number of new failures observed is ri. N units start test and the survivors at the end of test are S = N − Σri. The “real time” events of this experiment are diagrammed in Figure 9.1. Now assume we know ΔH. Then we can normalize all the time intervals to equivalent times for operation at the initial (first) temperature T1. First, we calculate all the acceleration factors between later temperatures Ti and the starting temperature T1. These are given by

AFi = e(

H/k ) ( 1/T1 )−( 1/Ti )

i = 2 , 3, … , k

The transformed length of the ith interval becomes AFi (Ri − Ri−1) and the equivalent single stress readout times (all at temperature T1) are

R1 = R1 R1 = Ri−1 + AFi (Ri − Ri−1 )

Using these transformed readout times, and the numbers of observed failures, lognormal or Weibull or exponential estimation can be carried out as shown in Chapters 3, 4, and 5. Of course, the problem with all this is we do not know ΔH in advance. If we did, we would not have to run a step stress experiment. However, if we were to assume different values for ΔH, and then fit life distribution parameters to the resulting transformed data, the “best” ΔH would be the one where the negative log likelihood for the transformed readout data is minimized. This ΔH estimate is the MLE and the corresponding distribution parameter estimates are also the MLEs for the life distribution at the stress conditions in effect during the first readout interval. Spreadsheets are powerful tools for carrying out this sort of analysis—they make it easy to calculate the transformed readout times and the negative log likelihood for a given set of starting parameter guesses, and then Solver can be used to compute the desired MLEs. JMP (release 6 or later) can also find MLEs for step stress data using the nonlinear model fitting platform. EXAMPLE 9.1 AN ARRHENIUS STEP STRESS EXPERIMENT A sample of 250 components from a population with a temperature dependent lognormal life distribution was tested to determine a T50 and σ at one temperature and an estimate of the Arrhenius ΔH parameter. Since only one test chamber was available, a step stress method was chosen.

303

Alternative Reliability Models

The initial temperature was 85°C. Readouts were made at 500, 1000, 1500, and 2000 hours. There were no failures at 500 hours, one failure at 1000 hours, seven failures at 1500 hours, and five failures at 2000 hours. At 2000 hours, the temperature was increased to 100°. There were two failures at the readout at 2100 hours. Then the temperature was increased to 115°. The next readout, at 2200 hours, showed four failures. At this point a final temperature increase to 125° was made. The last three readouts, at 2300, 2400, and 2500 hours, showed 36, 38, and 29 failures, respectively. At 2500 hours, the test ended, with 128 unfailed units. What are the lognormal parameters (T50 and σ) and ΔH estimates?

Solution Table 9.1 and Figure 9.2 show how ΔH could be estimated by an iterative graphical procedure. Table 9.1 gives the actual readout times and cumulative percent failures, as well as equivalent readout times (referenced to the 85° temperature) assuming values of 0.5, 0.86, and 1.0 eV for ΔH. TABLE 9.1 Arrhenius Step Stress Example Stress (°C) 85 85 85 85 100 115 125 125 125

Equivalent Readout Times

Readout Times

ΔH = 0.5

ΔH = 0.86

ΔH = 1.0

500 1000 1500 2000 2100 2200 2300 2400 2500

500 1000 1500 2000 2192 2542 3051 3560 4069

500 1000 1500 2000 2307 3168 4812 6455 8099

5500 11,000 1500 22,000 22,368 33,591 66,184 88,776 111,368

Cumulative % Failures 0.0 0.4 3.2 5.2 6.0 7.6 22.0 37.2 48.8

Probability plot for three values of delta H Lognormal Readout data - MLE lines drawn Delta H 0.50 0.86 1.00

Time

100000

10000

1000

0.1

1

5

10

20 30 40 50 60 70 80 Percentage

FIGURE 9.2 Plot of step stress data for ΔH = 0.5, 0.86, and 1.0.

90 95

99

304

Applied Reliability

For example, at 100°C the acceleration factors are 1.92, 2.07, and 3.68, respectively. So, 100 hours at 100°C is equivalent to 192, 307, or 368 hours at 85°C, depending on the value of ΔH. Figure 9.2 shows a plot on lognormal probability paper of the cumulative percent failures versus the equivalent readout times for each choice of ΔH. The MLE of ΔH = 0.86 (which we will derive next) gives the “best” line, with the high and low ΔH value lines fanning out to the right and the left of the “correct” line. Fitting successive least squares graphical lines to a range of ΔH values and picking the ΔH which gives the minimum residual sum of squares line would be the graphical estimation procedure. The graph in Figure 9.2 was obtained from Minitab. Now we show how the best estimates for the lognormal distribution parameters and the activation energy can be obtained simultaneously using MLEs via spreadsheet functions. Our first step is to set up the spreadsheet to calculate the equivalent lengths of the readout intervals under Arrhenius acceleration. Initially, we shall assume ΔH = 0.75 eV (we may later have to choose a different starting value if the analysis fails to converge to an MLE for ΔH). The spreadsheet is shown in Figure 9.3. Column H shows the actual length of each readout interval. Column I uses the Arrhenius formula, referencing the starting value of ΔH = 0.75 eV (in C2), to calculate transformed equivalent interval times if the temperature had remained at 85°C. The Arrhenius acceleration formula in I2 that is pulled down for all the readout intervals is

= H2 ⋅ EXP(($C$2 ⋅ 11605 ⋅ (F2 − $F$2))/((F2 + 273.15) ⋅ ($F$2 + 273.15))) Column J is the cumulative sum of column I that provides the equivalent end of readout intervals referenced to 85°C. The next step is to set up a column for the log-likelihood functions. For the first readout, we have

r1 ln[F (R1)] where r1 is the number of failures in the first interval of length R1 and F(t) is the cumulative distribution function (CDF) for the lognormal distribution with mean μ and shape parameter σ. For readouts, at Ri , i = 2, … , k, for k = 8, the log-likelihood functions are

ri ln[F (Ri ) − F (Ri −1)] To account for censoring at the final readout, the log likelihood is

rk ln(1− F (Rk ))

FIGURE 9.3 Spreadsheet for calculating step stress interval widths under Arrhenius acceleration.

Alternative Reliability Models

305

FIGURE 9.4 Spreadsheet for calculating step stress Arrhenius lognormal MLEs.

We need to express these relations as spreadsheet formulas. The function for the lognormal distribution in Excel is LOGNORMDIST (time, mu, sigma). As we see in Figure 9.4, column K will contain the log-likelihood expressions. For example, in cell K3, the formula is = G3 ⋅ LNI(LOGNORMDIST($J3,$A$2,$B$2) − LOGNORMDIST($J2,$A$2,$B$2)). Cells A2, B2, and C2 contain starting guesses for mu, sigma, and ΔH, respectively. Cell L2 is the negative sum of the log likelihoods in column K. Beginning with trial values of mu = 5, sigma = 1, and ΔH =.75, we use Excel solver to estimate the MLEs for mu, sigma, and ΔH. The results are displayed in Figure 9.4, which also shows the solver screen. The converged values of mu, sigma, and ΔH are 9.05, .80, and .86, respectively. In cell M2, T50 is evaluated using EXP(A2) to give 8507. The interested reader can download Ex9.1lnorm.xls from the “Downloads & Updates” tab on the web page for this book at http://www.crcpress.com/product/isbn/9781584884668. Since the data were simulated, we know the “true” values to compare to these estimates. These were: T50 = 8000, σ = .9 and ΔH = .85. Our MLEs, especially for ΔH and sigma, were very close to the true values. EXERCISE 9.1 Use the LR profile confidence interval method described in Chapter 8 (see Section 8.5) to derive the 95% confidence interval (.66, 1.12) for ΔH in Example 9.1. EXERCISE 9.2 Analyze the step stress data in Table 9.1 using a Weibull model assumption for the life distribution. (Hint: The formulas for the negative log likelihoods in column J of Figure 9.4 need to be changed to contain Weibull CDFs and the mu and sigma parameters replaced by the alpha and beta Weibull parameters in columns A and B—download Ex9.1Weib.xls to check your result.)

306

Applied Reliability

One comment on the design of the experiment described in the last: the most accurate results are obtained by staying at a low-level stress as long as needed to get some data, then going quickly to the highest possible stress—as was done in this case. Step stress experiments involving simultaneous changes to more than one stress can also be carried out. The approach is no different from the approach already discussed: Assume the acceleration model parameters are known and compute acceleration factors for every readout interval and use them to convert the duration of that interval to an equivalent duration for units operating at the lowest (initial) set of stresses. Then use the MLE method to find estimates for the acceleration parameters and the distribution parameters at that lowest stress condition. Example 9.2 will carry out this analysis for a step stress Arrhenius, power law acceleration model under a Weibull life distribution assumption. EXAMPLE 9.2 AN ARRHENIUS, POWER LAW STEP STRESS EXPERIMENT A component normally operates at 25°C and 2 V. Previous life test experiments during the product development stage had shown the Weibull life distribution and the Arrhenius (inverse) power law H

acceleration model α = Ae kT V B were good models for the wear-out failure mode. The historical values of ΔH and B were .55 and −1.3, respectively, with a Weibull β of 1.9. The acceleration factor for operation at 125°C and 16 V based on this model is 3230 and this value was used to construct an end of the line life test at 125°C, 16 V. This test was used to check a periodic sample of new units. Over the course of time, after several cost-effective process modifications, it was decided that it would be prudent to check that the model, developed for the first generation of components, was still valid. Since resources were limited, it was decided to run one cell of 225 components, increasing the stress levels as the test progressed. The initial stresses were set at 85°C and 8 V. These stress levels were kept for two readouts at 500 and 750 hours, with three and four fails recorded, respectively. Then, the temperature level and voltage were increased to 105°C and 12 V for the next 250 hours. There were 32 fails recorded at the 1000-hour readout. The voltage was increased to 16V for the next 250 hours, and there were 50 failures recorded at the 1250-hour readout. The last three readout intervals, ending at 1500, 1900, and 2000 hours, had stress levels of 125°C and 10, 12, and 16 V, respectively. The corresponding failures recorded were 62, 64, and 10, with no unfailed units left at the 2000 hours end of test. Assuming a Weibull distribution and the Arrhenius power law acceleration model, estimate ΔH, B and beta.

Solution A spreadsheet for the data analysis (Ex9.2.xls) is given in Figure 9.5. The formula for calculating transformed interval durations (in J2 and pulled down) is

= I2 ⋅ (EXP(($C$2 ⋅ 11605 ⋅ (F 2 − $F$2))/((F 2 + 273.15) ⋅ ($F$2 + 273.15)))) ⋅ ($G$2/G2) ^ $D$2 The function for the Weibull distribution in Excel is WEIBULL (time, beta, alpha, true). As we see in Figure 9.5, column L will contain the log-likelihood expressions. For example, in cell L3, the formula is

= H3 ⋅ LN(WEIBULL(K3,$B$2,$A$2, TRUE) − WEIBULL(K2,$B$2,$A$2, TRUE)) Solver, also shown in Figure 9.5, calculated the MLEs shown in A2, B2, C2, and D2. Since the β, ΔH and B estimates closely matched historical values, no change was made to the end of line reliability test specifications.

Alternative Reliability Models

307

FIGURE 9.5 Spreadsheet for calculating step stress Arrhenius/power law Weibull MLEs.

EXERCISE 9.3 Show that the 90% LR confidence interval for ΔH is the wide interval (0.07, 1.19).

Note the wide confidence interval for ΔH given in Exercise 9.3. The downside of trying to fit a two-stress model using only one cell and the step stress technique is the lack of precision in the results. The estimates have a large variance and are highly correlated with one another. Fitting the wrong distributional model, such as using a lognormal for the data of Example 9.2, might give very misleading model estimates. That is the price you often pay when trying to fit a complex two-stress model with a single cell of step stress data. On the other hand, the gains in sample size, time, and resources are often ample compensation for the increased risks. Before leaving the topic of step stress data analysis, we mention that it is possible to use JMP (release 6 or later) to find MLEs for the data given in Example 9.1. However, the procedure requires knowledge of JMP scripting to set up the formulas. For the interested reader, we show in Appendix 9.1A how to use JMP on the step stress data of Table 9.1.

9.2 Degradation Models In Chapter 8, and with step stress experiments, many test failures at different stress levels were needed to estimate life distribution and model parameters. In certain cases, however, it is possible to do an analysis without actual failures. This occurs when there is a measurable product parameter that is degrading over time towards a level that is defined to be a “failure” level.

308

Applied Reliability

For example, a component may start test with an acceptable resistance value reading. Over time the resistance reading “drifts.” Eventually it reaches a certain unacceptable value, or undergoes an unacceptable percent change, and the part is considered to have failed. At every test readout, the resistance reading can be measured and recorded, even though failure has not yet occurred. If we call each measurement a data point, a test cell of n components yields n data points at every readout time, even though few or none of the components may fail during the test. We need one key assumption to make use of this data. There has to be some function of the measurable parameter that is changing linearly with time. In some cases, it is more convenient to have the change be linear with log time or time to an exponent, but we will restrict ourselves to the simple case in this treatment. For example, let Qt be a threshold voltage level which degrades over time according to the formula Qt = Q0e−Rt. Then, the function G(Qt) = ln Qt = ln Q0 − Rt changes linearly with time. If we call Qt the measurable product parameter at time t, then the linear change assumption implies

G(Qt ) = G(Q0 ) + R(S)t or G(Qt ) = I + R(S)t

where R(S) is the degradation slope at stress S and I is the initial value of G at time zero (or G(Q0)). Of course there will also be error introduced in every observation due to experimental error and component to component differences. These types of errors will be discussed as we look at ways of estimating R(S) from data and using these estimates to fit acceleration models. We will look at two ways we can use degradation data to fit acceleration models. 9.2.1 Method 1 This method is similar to other previously described graphical techniques, in that it is informal but easy to understand and carry out without sophisticated statistical tools. On a sheet of graph paper, let the y-axis represent G(Qt) and the x-axis represent time. Draw a horizontal line across the paper at the G(Qt) value that is defined to be a failure. Next plot the data points corresponding to the G(Qt) readings for a test unit at each readout. If the right G(Qt) has been chosen these points should line up approximately on a straight line. They will not be exactly on a line because of experimental measurement errors, which we assume are independent and identically distributed. Fit a line through the points, either by eye or by means of a regression program. Then extend the line until it crosses the failure line and note the corresponding time. This value is the derived time of fail for the first unit (see Figure 9.6). Repeat this procedure for every unit in every stress cell. The lines will have different slopes because of component to component variation in degradation slope—this variation is in addition to measurement error and is the reason component lifetimes vary in a population. When this part of the analysis is complete, the derived failure times make up a data set similar in appearance to a life test experiment where every unit was tested until failure. These derived failure times can be used to estimate life distribution parameters. Then we can fit an acceleration model, using the methods we have already discussed in Chapter 8. One problem with this procedure is that the many readout measurement errors introduce additional variability into the data, which might inflate the measurement of the life distribution shape parameter. For this reason, it is good to have a shape parameter estimate (lognormal σ or Weibull β) based on actual failures in one high stress cell, if possible.

309

Measurement

Alternative Reliability Models

Failure limit

Tr0 Tr1 Tr2

Tr3

t1

t2

t3 t4

t5

Time

FIGURE 9.6 Degradation data extrapolation to failure times.

9.2.2 Method 2 First we place the additional restriction that G(Q0) be zero: that is the same as saying the initial value I is zero and only the amount of degradation over time is important, not a unit’s starting point. Typically, this assumption is valid since very common functions for G(Qt) are percent change or absolute change from the initial time zero value. Call D the failure level value for G(Qt) that signals a unit no longer meets functional requirements. D is the “distance” to go until failure is reached. The model now is of the form D = R × tf , or, distance to failure equals rate of degradation multiplied by time to failure. Solving for time to failure, tf = D/R(S). Consequently, time to failure is proportional to the reciprocal of the stress dependent slope. The lognormal life distribution is a common model for degradation failures (see Section 5.8), but the reciprocal of a lognormal distribution is also lognormal (with the same σ). This property means that the random degradation slopes of units operating at the same temperature will have a lognormal distribution. In other words, each unit can be considered to have its own degradation slope chosen from a population of slopes following a lognormal model. The shape parameter for R(T) is the same as for tf, and the median R50(T) is related to T50 by R50(T) = D/T50. If we now assume T50 varies with temperature according to an Arrhenius relationship, we can write

R 50 = Ae

− H kT

where A is a constant term, replacing D and k = 8.617 × 10−5 eV/°K (Boltzmann’s constant). Temperatures are expressed as degrees Kelvin or 273.15 + °C. In order to use the Arrhenius equation to estimate ΔH, we need estimates of R50(T) for each temperature cell. In the cell with temperature T1, we calculate the set of slopes {G(Qtmax)/tmax}, one for each component, and treat this set as if it were a sample of lognormal

310

Applied Reliability

observations. Here, tmax is the last readout time in the cell and G(Qtmax) will generally be the largest amount of degradation observed for each component (although there may be exceptions to this rule in low-stress cells). By dividing the largest degradation amount by the largest readout time, the experimental or observational error should be significantly reduced, leaving a good slope observation for that component. Earlier readout G(Qt) measurements may have errors that are much more significant when compared to the amount of degradation, and so we are better off not using them except to check that they are consistent with the model we fit. To get Rˆ 50 (T1 ) for a temperature cell with n units on test, we analyze the sample of n slopes using standard lognormal methods. Since the sample is complete, the simplest procedure would be to compute the sample mean and standard deviation of the natural logarithms of the slopes. The sample mean is ln Rˆ 50 (T1 ) and the standard deviation s is an estimate of the life distribution σ (which is equal to the degradation slope σ). If there are J temperature cells, and we let y1 = ln Rˆ 50 (T1 ), y 2 = Rˆ 50 (T2 ), and so on with

y j = ln Rˆ 50 (Tj )

then

y j = a + bx j + ε j

where a = ln A, b = −ΔH, xj = 1/(kTj) and the εj’s are random, independent errors that we will also assume are identically distributed (which should be approximately true if all the stress cells have the same number of test components). Standard regression programs can be used to estimate a and b (with confidence bounds, if we examine the εj and it is reasonable to assume they are normally distributed). A use Tˆ50 can then be projected by dividing D (the “distance” to fail) by Rˆ 50 (Tuse ). As was the case with method 1, a use shape parameter should be estimated from actual failure data. Weighted regression, using the square root of the cell sample sizes, should be used if the cells have different numbers of components. Note from the previous equation that the estimate −bˆ is also the estimate of ΔH, and therefore it can be used to calculate the acceleration factor between any two temperatures using the formulas described in Chapter 7. If only two temperature cells are run, and each cell has the same number of test units and readouts, then the estimate of ΔH is

(

H = k( y1 − y 2 ) {1/T2 } − 1/T1 }

)

−1

where y1 is the average of all the ln R(T1) estimates from the cell at temperature T1, and y2 is the average of all the ln R(T2) estimates from the cell at temperature T2. More complicated acceleration models like an Eyring for several stresses are handled the same way. The median cell slope estimates measuring the rate of degradation are set equal to the reciprocal of the acceleration model equation. This expression for R(t) can usually be transformed into a linear form by taking logarithms and changing variables. A program for multiple regression can then be used to solve the resulting equations.

311

Alternative Reliability Models

Using degradation data gives us many data points, even from small numbers of units on test. We can also include stress cells that are close to use conditions, as long as the amount of parameter drift we are measuring stands out from the instrument measurement error. We are protected from misjudging the proper sample sizes and stress levels needed to obtain adequate failures. These are several very desirable properties of degradation or drift modeling. The disadvantage of using this data to model acceleration is that it takes us one step further away from reality when we deal with parameter drift instead of actual failures. What do we do about units that do not appear to drift at all? What about those that degrade and then seem to improve or recover? All these situations present both mathematical and conceptual difficulties that can cast doubt on the analysis validity. Our recommendation is to use drift or degradation analysis only when the drift mechanism is understood and relates directly to actual failures. Even then, plan to have at least one cell with high enough stress to produce actual failures. This cell can be used to test the validity of the T50 projections obtained from degradation modeling, and to estimate the distribution shape parameter. EXAMPLE 9.3 DEGRADATION DATA ANALYSIS A resistor used in a power supply is known to drift over time, degrading eventually to failure. Increasing the operating temperature speeds up resistor degradation. Once a change of 30% is reached, the power supply can no longer function. Preliminary studies indicate the Arrhenius model applies. Two stress cells, each containing 10 resistors, are run at 105°C and 125°C. Percent change in resistance from time zero is measured for each unit at 24-, 96-, and 168-hour readouts. The results are given in Table 9.2. Use both method 1 and method 2 to estimate ΔH and project an average failure rate over 100 K h of field life at a use temperature of 30°C.

Solution Following method 1, least squares was used to fit straight lines through both the origin and the three degradation readings for each resistor on test. Next, the time where each line crossed the 30% failure point was calculated. The lines and the projected failure times for each resistor in the 105°C are shown in Figure 9.7. The same information for the units in the 125°C cell is shown in Figure 9.8.

TABLE 9.2 Degradation Data 105°C Cell % Degradation Component 1 2 3 4 5 6 7 8 9 10

125°C Cell % Degradation

24 hours

96 hours

168 hours

24 hours

96 hours

168 hours

1.0 0 1.1 0 1.5 0.8 0.9 0 2.0 0.6

6.0 0.4 4.5 1.7 5.4 4.3 2.6 3.0 6.0 0.7

11.1 1.3 6.4 2.7 9.4 6.4 3.9 5.5 10.1 1.0

7.1 0.6 3.0 2.5 0.3 0.9 2.8 1.8 5.3 3.5

27.1 3.6 12.9 11.6 3.1 3.1 11.2 10.8 22.0 13.4

46.8 5.3 21.7 18.5 7.5 5.4 21.2 18.2 38.4 22.4

312

Applied Reliability

105°C Cell

Percentage degradation

30

463 491 744 757 534

841

1230

1800

20

2 lines not shown reached 30% at 4,441 and 11,590

10

0

0

250

500

750

1000

1250

1500

1750

2000

FIGURE 9.7 Projected degradation failure times, 105°C. 125°C Cell

Percentage degradation

30

107

220 240 260 131 200 274

721

20

10

0

2 lines not shown reached 30% at 912 and 930

0

100

200

300

400

500

600

700

800

FIGURE 9.8 Projected degradation failure times, 125°C.

The T50 estimates, using the projected failure times given in Figures 9.7 and 9.8, are 1115 hours and 309 hours for the 105°C and 125°C cells, respectively. A pooled overall σ estimate is 0.81, obtained using standard methods for normal data (working with the ln tf from each cell). The estimate of ΔH is calculated from

H = k ln

T 501 T 50 2

1 1 − T1 T 2

−1

= .83

and the acceleration factor for 125°C to 30°C is 2009. Finally, AFR (100 K) is estimated to be 119 FITs.

Alternative Reliability Models

313

For a method 2 slope analysis, we start by dividing all the 168-hour readings by 168. This gives us ten slope estimates from each temperature cell. For the 105°C cell the data are 0.06607, 0.00774, 0.0381, 0.01607, 0.05595, 0.0381, 0.02321, 0.03274, 0.06012, and 0.006. For the 125°C, we have: 0.2786, 0.03155, 0.1292, 0.1101, 0.04464, 0.03214, 0.1262, 0.1083, 0.2286, and 0.1333. Analyzing these two sets of data as lognormal complete samples gives an Rˆ50 for 105°C of 0.0267 and an Rˆ50 for 125°C of 0.0971. The pooled σ estimate is 0.80. The ΔH estimate is obtained by letting y1 = ln 0.0267 and y2 = ln 0.0971 and using the equation for ΔH given in the explanation of method 2. The result is ΔH = 0.84 and the acceleration from 125°C to 30°C is 2122. The Tˆ50 estimate for the 125°C cell is just D/Rˆ50 or 30/0.0971 = 309 (the same as with method 1). The use AFR at 100 K h is estimated to be 95 FIT’s.

After studying the two proposed methods and Example 9.3, some obvious questions arise. Which method is better? And if method 2 is a good way to analyze degradation data, why bother to take any intermediate degradation readouts? If no intermediate readouts are made, both methods will give the same answers. However, the intermediate readouts allow one to visually examine the key assumption of linearity. So these readouts should be included in at least one cell, even if the simpler method 2 is chosen for the final analysis. As to which method is “better,” there is no clear answer. Both methods are “quick and dirty” in the sense that they tend to ignore the experimental error in the observations made at each readout. And both methods, as shown in Example 9.3, will give nearly the same results. The important thing to focus on is that either method allows us to calculate an estimate for ΔH and project use failure rates from an experiment where one cell had no actual failures and the other cell had only two failures. Nelson (1990) has an extensive chapter on “Accelerated Degradation” that describes many of the models that have been used in the literature. Meeker and Escobar (1998) also have a chapter on “Degradation Modeling.”

9.3 Lifetime Regression Models When we replace the log of the time to fail (or a log of a scale parameter such as T50 for the lognormal or α for the Weibull) by a linear regression model with unknown parameters, we are doing lifetime regression. We assume the shape parameter of the distribution (σ or β) does not change but the time to reach any given percentile of failures depends on the regression model. All the acceleration models discussed in Chapter 7 were examples of lifetime regression models. In this section, we show how to use lifetime regression to analyze reliability data when categorical effects may also influence lifetime. This is a very important analysis tool, as it allows us to test whether things like vintage, vendor, plant of manufacture, or the presence or absence of some other factor have a significant effect on product lifetime. The analogous tool used to answer these kinds of questions in ordinary statistical analysis, where we are not analyzing censored time to failure data, is the analysis of variance (ANOVA). The software programs we have previously used to estimate acceleration model parameters (JMP, Minitab, Excel) can all estimate the effect of categorical factors on lifetime and test whether these effects are significant. The user has to know how to input those factors on the spreadsheet and how to calculate the appropriate LR tests for significance. We will show how to do this with an example based on a multimillion-dollar decision a company had to make after analyzing field reliability data. While the data in this example are not the original (proprietary) data, the situation and results accurately reflect what actually took place.

314

Applied Reliability

EXAMPLE 9.4 LIFETIME REGRESSION USED TO ESTIMATE THE RELIABILITY EFFECTS OF VINTAGE AND PLANT OF MANUFACTURE AND THEIR SIGNIFICANCE A company discovered it was receiving many field failures due to the same component. Machines containing 100 of these components had been shipped every month for over a year. The components were made at two different plants (A and B) and had two process vintages (I and II). A first look at the field returns showed that out of 17 failures, 11 came from Plant B product, but Plant B had only half as many components in the field as Plant A (1800 vs. 3600). When it became obvious that many components made in Plant B were failing, shipments using Plant B product were stopped. The management question under debate was whether to scrap a large remaining supply of Plant B components. After an inventory investigation it was known that:

1. Plant B shipments were predominantly of Process I (80%) product, while Plant A shipments were predominantly of Process II (80%) product. 2. Most of the failures were Process I parts (14/17)—but perhaps that was because so many of these parts were made in Plant B. 3. Plant B machines started shipping several months before Plant A machines started shipping— so they were the oldest field parts. The question to decide from an analysis of the field data was whether there were significant plant and/or process effects that would lead to a “fact”-based decision about scrapping or not scrapping millions of dollars of inventory from Plant B. This is the type of question lifetime regression models can answer using nominal columns to label plant and process types corresponding to data rows. The data and assumptions follow:

1. Plant A machines shipped 3 per month at the start of each month for the last year—that amounted to 36 machines with 720 Process I components and 2880 Process II components. 2. Plant B machines started shipping 2 machines at the start of each month, 15 months ago. Shipments were stopped after 9 months. That amounted to 1440 Process I components and 360 Process II components. The shipment and failure data are summarized by plant and process in Table 9.3. A third assumption was also used: 3. Previous lab testing and field data for the failing components indicated a Weibull model was a good fit for the component life distribution. While the scale parameter α sometimes varied from lot to lot, experience showed that the shape parameter β remained more or less constant. Exact failure times were known, along with the corresponding plant and process information. Censoring times are based on the current ages of the machines in the field. The JMP data table given in Figure 9.9 shows all the censoring times and all the exact times of failure for both Plant A and Plant B product (1 in the Censor column indicates a right censoring

TABLE 9.3 Summary of Shipment and Failure Data Total Components Shipped Plant A Plant B Failures by process

3600 1800

Process I

Process II

Failures by Plant

2880 1440 14

720 360 3

6 11 17

Alternative Reliability Models

FIGURE 9.9 JMP spreadsheet for the plant/process field reliability data.

315

316

Applied Reliability

time and 0 indicates a failure time). The Count column gives the corresponding number of censored or failed components. The columns labeled “Plant” and “Process” are nominal columns that indicate which plant (A or B) and which process (I or II) a particular data row comes from.

Solution Once we have entered all the data as shown in Figure 9.9, it is easy to do the analysis. (The JMP data table can be downloaded as Example9.4.jmp.) Figure 9.10 shows how to enter the Fit Parametric Survival Model Specification screen. After clicking on Go, the results are shown in Figure 9.11. The bottom of the JMP analysis output in Figure 9.11 shows the Effects LR Tests of significance for the factors “Plant” and “Process.” “Plant” has a “p value” or, as JMP calls it, a “Prob > Chi Sq” value of 0.7013. We look for p values less than .05 for 95% significance (less than .01 for 99% significance). The p value for “Plant” is not even close to being significant, implying there is no statistical support for considering Plant B product different from Plant A product. The result is different, however, when we look at the Effects LR Test for “Process.” The p value is .011, which is nearly 99% statistically significant. “Process” clearly makes a difference as far as field reliability is concerned. In order to compute Weibull α parameter values for the various process and plant combinations, we need to know what JMP means by the estimates labeled “intercept,” “Plant[A],” and “Process[I].” When JMP fits a model with categorical factors, it puts the average effect for a factor into the “intercept” term and scales the effects of the different levels of the factor so that they add up to zero. For a factor with only two levels, as we have here for both “Plant” and “Process,” JMP estimates one of the levels and the estimate of the other level is the negative of the one estimate shown. When there are three or more levels, estimates are shown for all but one of the levels and the missing level estimate is the negative of the sum of the other level estimates. The fitted model displayed in Figure 9.11 is

FIGURE 9.10 Fit parametric survival screen for plant/process reliability data.

317

Alternative Reliability Models

Estimates for ln α Plant A, Process I: 10.434 − 0.031 − 0.237 = 10.166 Plant B, Process I: 10.434 + 0.031 − 0.237 = 10.228 Plant A, Process II: 10.434 − 0.031 + 0.237 = 10.640 Plant B, Process II: 10.434 + 0.031 + 0.237 = 10.702

Plant differences are not even close to being statistically significant! Process differences are very significant (with Process I worse) Conclusion: Scrap all Process I parts but use everything else from either Plant FIGURE 9.11 JMP analysis results for plant/process reliability data.

ln α ij = 10.434 + Plant [ i ] + Process [ j ]

Plant [1] = Plant [ A ] = −0.031, Plant [ 2] = Plant [B] = +0.03 31 Process [1] = Process [I] = −0.237,

Process [ 2] = Process [II] = +0.237

The four estimates of α shown in Figure 9.11 come from this equation and cover every combination of plant and process in the field. The single β estimate is 3.613 (=1/δ). The value for −2 log likelihood, after fitting the model, is shown by JMP to be 484.77. The spreadsheet for doing the same analysis using Minitab is partially shown in Figure 9.12. The Minitab analysis platform to use, found in the pull-down Reliability/Survival menu, is called Regression with Life Data. The entries are shown in Figure 9.13. Not shown is the screen obtained from the Censor button that is used to enter the Censor Code column and in which we indicate the censoring value as 1. Clicking OK gets the analysis printout in Figure 9.14. At first, almost all the results seem to disagree with those obtained from JMP. That is because Minitab uses a different form of model and parameter labeling. Its “Intercept” term is the α value for operation at the factor levels not shown, for Plant A, Process I parts, in this case. The estimates “Plant B” and “Process II,” when added to the intercept term, change the α value to correspond to Plant B and/or Process II product. Interpreting Figure 9.14 this way produces α values for the product types that correspond closely with JMP. Minitab uses an approximate normal test procedure based on the Wald statistic to calculate p values that assess factor significance, so the numbers are somewhat different from the LR chisquare numbers obtained by JMP. However, the conclusions reached are the same: plant is not significant but process is significant at greater than the 95% significance level. One last point of difference between Minitab and JMP should be mentioned. Minitab show the minimized negative log likelihood for the fitted model to be 242.386 while JMP shows this value multiplied by 2.

318

FIGURE 9.12 Minitab spreadsheet for the plant/process field reliability data.

FIGURE 9.13 Minitab regression with life data screen for plant/process data.

Applied Reliability

319

Alternative Reliability Models

Regression with Life Data: Times versus Plant, Process

Plant differences are not even close to being statistically significant! Process differences are very significant (with Process I worse)

FIGURE 9.14 Minitab regression with life data output for plant/process data.

Surprisingly, we can also carry out this sophisticated lifetime regression analysis using an Excel spreadsheet. It takes some effort, but the exercise is worthwhile, especially for those without a good statistical software package handy. The procedure is shown in Appendix 9.2A. N OT E : Readers familiar with design of experiments models might ask why a term for “inter-

action of plant and process” was not included in the model. An easy way to check whether this term is needed would be to construct a plant/process interaction plot using the ln α estimates from Figure 9.11. The resulting almost parallel lines would indicate there is no need for an interaction term. EXERCISE 9.4 Vendors A and B supply lots of the same component used by Company C. Some recent field data suggest that the parts supplied by Vendor B may not be as reliable as the parts supplied by Vendor A. A reliability engineer for Company C is asked to stress test samples from both vendors and tell management whether there is a statistically significant difference in reliability. His plan was to test samples of 30 parts from each vendor for 150 hours (a time sufficient to get reasonable numbers of failures at stress) and perform a statistical test using the numbers of failures seen from each vendor (the Fisher exact test described in Chapter 11, Section 11.9.2, would work well here). He finishes testing 150 hours for Vendor A components but only has completed 100 hours of testing for Vendor B’s components when management has to make a final decision on whether to use Vendor B parts and asks for his conclusions. He has exact time of failure data for each vendor. Vendor A had 13 failures at 15.1, 30.3, 38.4, 63.4, 71.7, 73.7, 84.7, 90.8, 92.9, 98.4, 100.9, 127.9, and 139 hours. The remaining 17 components had not failed at 150 hours. Vendor B had 16 failures at 25.4, 27.2, 28.4, 30.8, 31.9, 33.3, 34.1, 53.6, 55.1, 61.6, 64.6, 65.1, 65.9, 68.7, 93.4, and 97.8 hours. The remaining 14 components had not failed at 100 hours.

320

Applied Reliability

Based on past history, a lognormal life distribution is a satisfactory model. Assuming both vendors have the same shape parameter (which can be tested using methods described in Chapter 8), use lifetime regression to decide whether there is a significant reliability difference (at the 95% level) between the two vendors.

9.4 The Proportional Hazards Model The proportional hazards model, proposed by Cox (1972), has been used primarily in medical testing analysis, to model the effect of secondary variables on survival. It is more like an acceleration model than a specific life distribution model, and its strength lies in its ability to model and test many inferences about survival without making any specific assumptions about the form of the life distribution model. This section will give only a brief description of the proportional hazards model, since it has limited engineering applications. 9.4.1 Proportional Hazards Model Assumption Let z = {x, y, …} be a vector of 1 or more explanatory variables believed to affect lifetime. These variables may be continuous (like temperature in engineering studies, or the dosage level of a particular drug in medical studies) or they may be indicator variables with the value 1 if a given factor or condition is present, and 0 otherwise. Let a nominal (or baseline) set of these variables be given by z0 = (x0, y0, …) and let the hazard rate when the variables have the value z0 be h0(t), where h0(t) is a legitimate hazard function (failure rate) for some unspecified life distribution model. The proportional hazards model assumes we can write the changed hazard function for a new value of z as

hz (t) = g( z)h0 (t)

In other words, changing z, the explanatory variable vector, results in a new hazard function that is proportional to the nominal hazard function, and the proportionality constant is a function of z, g(z), independent of the time variable t. A common and useful form for g(z) is the log linear model, which has the equation g(x) = eax for one variable, g(x, y) = eax + by for two variables, and so on. 9.4.2 Properties and Applications of the Proportional Hazards Model The proportional hazards model is equivalent to the acceleration factor concept if and only if the life distribution model is a Weibull (which includes the exponential model, as a special case). For a Weibull with shape parameter β, and an acceleration factor A between a nominal use fail time t0 and a high stress fail time ts (with t0 = Ats) we have g(s) = Aβ. In other words, hs(t) = Aβh0(t). Under a log-linear model assumption for g(z), without any further assumptions about the life distribution model, it is possible to analyze experimental data and compute MLEs and use LR tests to determine which explanatory variables are highly significant.

321

Alternative Reliability Models

In order to do this kind of analysis, however, special software is needed. More details on the theory and applications of the proportional hazards model may be found in Cox and Oakes (1984).

9.5 Defect Subpopulation Models This section describes a very useful and perhaps underused model that can match the failure rate characteristics of components and systems often encountered in practice. Suppose a small proportion of a population of components has the property of being highly susceptible to a certain failure mechanism. Consider these units to be manufacturing defects in a reliability sense—working originally and not detectable as damaged by standard end-of-the-line tests and inspections; yet, after a short period of use, they fail. The rest of the components in the population are either not susceptible to this failure mechanism, or else they fail much later in time for possibly some other cause. A typical example might be a failure mechanism that only occurs when there are traces of a certain kind of contaminant left on critical spots within the component. Another example might be when a small proportion of hermetically sealed units have entrapped moisture that leads to early corrosion failures. Occasionally, field or life test data indicates this bimodal behavior by showing an early incidence of failures that seem to slow down to almost nothing long before a significant proportion of the population has failed. A probability plot of this kind of data begins by following a straight line, then appears to bend over. A good model for this kind of behavior is the “defect model.” The population of components is really a mixture of two populations: a reliability defective small subpopulation that fails early and the rest of the components that will fail due to wear-out mechanisms that only are seen much further out in time (designated as “immortals” as far as the current test or field data is concerned). Let the p be the susceptible fraction of the total population and let Fd(t) be the CDF that applies to that susceptible subpopulation. The total CDF, F(t), for the entire population (up until the long-term wear-out modes become visible) is

F(t) = p[ Fd (t)]

This model, where a (hopefully small!) proportion of the population follows an early life failure model, has a wide applicability (see, for example, Trindade 1991). Some authors describe these early failures as “infant mortality,” a term that goes back to the actuarial origins of the hazard rate. The primary question is how to detect and analyze data with failures from a reliability defective subpopulation. For example, if we test 100 units for 1000 hours and have 30 failures by 500 hours, and no more by the end of test, are we dealing with two populations or just censored data? If we continue the test, will we see only a few more failures, because we have “used up” the bad ones, or will the other 70 fail following the same life distribution model? The easiest way to spot that data contains a defect subpopulation is by graphical analysis. Assume, for example, that the failure mode is one typically modeled by a lognormal distribution. If we plot the cumulative failures on lognormal graph paper and instead of

322

Applied Reliability

following a straight line the plot seems to curve away from the cumulative percent axis, it is a signal that a defect subpopulation might be present. If we run the test long enough and a defective subpopulation is present in the sample, we would expect the plot to bend over asymptotic to a cumulative percent line that matches the proportion of defectives in that sample. Thus, the clues are a plot with points that bend and a slowing down or a complete stop of new failures, even though many units are still on test. In addition, a physical reason to expect reliability defects, often based on failure analysis, is highly desirable. Returning to our “30 failures out of 100 on test in the first 500 hours and no more failures for the next 500 hours” example, consider what would happen if we replot the data, this time assuming only 30 units were on test. If these were truly the complete reliability defective subpopulation of units on test, and the lognormal model is appropriate for their failure mode, then the new plot should no longer bend. The T50 and sigma estimates obtained from the best fitting line (or an MLE program) would be the proper parameter estimates for the Fd(t) distribution of the defective subpopulation. The estimate of p, the proportion defective, would be .3. We would run into trouble, however, if there were a large number of defective units left on test that had not had enough time to fail. The original plot, with unadjusted sample size, would have less curvature and it might not be clear where the asymptote is. If we worked by trial and error, adjusting the number on test from the full sample size down to the actual number of failures, we might be able to pick one plot where the points line up best. This fitting can be done by computer, using an iterative least squares fit and choosing a sample size that yields the smallest least squares error. The maximum likelihood approach will be described after a simple example using the graphical least squares approach is given. One clear advantage of the maximum likelihood approach, in addition to its superiority as a parameter estimation method, is that it can be used to statistically test the hypothesis that there actually is a defective subpopulation. EXAMPLE 9.5 DEFECT MODEL A certain type of semiconductor module has a metal migration failure mechanism that is greatly enhanced by the presence of moisture. For that reason, modules are hermetically sealed. It is known that a small fraction will have moisture trapped within the seal. These units will fail early and it is desired to fit a suitable life distribution model to these reliability defects. Test parts are made in such a way as to greatly increase the chance of enclosing moisture in a manner typical of the normal manufacturing process defects. 100 of these parts were randomly selected and put on life test for 2000 hours. There were 15 failures. The failure times were: 597, 623, 776, 871, 914, 917, 1021, 1117, 1170, 1182, 1396, 1430, 1565, 1633, and 1664. Estimate the fraction defective in the sample and the life distribution parameters for this subpopulation, assuming a lognormal model applies. Use an iterative least squares method.

Solution Figure 9.15 shows a lognormal plot of the fail times, using approximate median rank plotting positions 100(i − 0.3)/100.4 in percent for the ith fail. The plot does appear to have curvature, although exactly where it bends over to is not readily apparent. By iterative least squares trials, changing n from 100 down to 15, a “best” fit occurs at n = 18. This plot is shown in Figure 9.16. Here the points line up well. The graphical T50 and sigma estimates are 1208 and 0.43. (These results are quite good since the data were simulated from 20 defects with a T50 of 1200 and a sigma of 0.45.) The best fit fraction defective estimate of 18/100 is also close to the “true” value of 0.20. Solving for the parameters in the defect model using the method of maximum likelihood turns out to be a simple extension to the maximum likelihood theory described in Chapter 4. In that chapter, maximum likelihood equations were given for various types of censored data. The basic

323

Alternative Reliability Models

99.99 99.9

99 98 95 90 80 70 60

40 30 20

10 5

2 10.5

0.1

0.01

Time

10,000

1,000

100 0.01

0.1 0.5 1 2

5 10 20 30 40 60 70 80 90 95 98 99 Cumulative percent failures

99.9 99.99

FIGURE 9.15 Lognormal probability plot of 15 out of 100.

Time

99.99 10,000

99.8 99 98 95 90 80 70

50

30 20 10 5

2

0.5 0.1

0.01

1,000

100 0.01

0.1 0.5

2

FIGURE 9.16 Lognormal probability plot of 15 out of 18.

5 10 20 30 50 70 80 90 95 98 99 99.8 Cumulative percent failures

99.99

324

Applied Reliability

building blocks of the equations were the PDF f (t) and the CDF F (t). If, however, only a fraction p of the population is susceptible to the failure mechanism modeled by F(t), then pF(t) is the probability a randomly chosen component fails by time t. Similarly, the “likelihood” of a randomly chosen component failing at the exact instant t becomes pf(t). The rule for writing likelihood equations for the defect model is to substitute pf and pF wherever f and F appear in the standard likelihood equation. To briefly review MLE theory: The standard likelihood equation for Type I censored data (n on test, r fails at exact times t1, t 2, ... , tr , and (n − r) units unfailed at time T = the end of test) is given by

i=r

LIK =

∏f (t ) [1− F(T )]

n− r

i

i=1

If we suspect only a fraction p of the population is susceptible to failure, the (defect model) maximum likelihood equation becomes

i=r

LIK = p r

∏f (t ) [1− pF(T )] i

n− r

i=1

Maximum likelihood estimates are defined as the values of p and the population parameters that maximize LIK. These are the same values that minimize L = −log LIK, an easier equation to work with. If MLEs have been calculated for a suspected defect model, it only takes a little more work to test the hypothesis p = 1 (i.e., there is no defect subpopulation) versus the alternative that only a fraction of the population will actually fail. Let L1 be the minimum log likelihood for the standard (nondefect) model and let L 2 be the minimum log likelihood for the defect model. The LR test statistic is λ = 2(L1 − L 2). If the hypothesis p = 1 is true, λ will have approximately a chi-square distribution with 1 degree of freedom. If λ is larger than, say, χ21;95 = 3.84, then we reject the standard model and accept the defect model at the 95% confidence level. EXERCISE 9.5 Derive the defect model likelihood equation for interval (or readout) data where the readout times are T1, T2, ... , Tk = the end of test. Assume r1 fails are discovered at time T1, r2 at T2, and so on until the end of test. There are r total fails observed and n − r units removed at the end of test.

EXAMPLE 9.6 MAXIMUM LIKELIHOOD ESTIMATION FOR THE DEFECT MODEL Calculate the MLEs for the data given in Example 9.5, assuming a defect model applies. Use the LR ratio test to see whether the data supports the existence of a susceptible subpopulation.

Solution Unfortunately, despite the fact that defective subpopulations are quite common in component data, no current, widely available software package includes maximum likelihood estimation routines and LR tests for the defect model. However, future versions of JMP (i.e., beyond JMP 8) will include this capability. Currently, however, JMP, with its Formula Editor and Non-Linear Analysis Platform, can easily be programmed to estimate parameters for defective subpopulations and calculate the L values needed to test the hypothesis that p = 1, that is, that all components are part of the same population (no defective subpopulation).

Alternative Reliability Models

325

FIGURE 9.17 JMP data table for defect model analysis of the Example 9.5 data.

In JMP’s Nonlinear Regression Platform, one can define a special loss function to use instead of least-squares. When a custom loss function is defined, the nonlinear platform minimizes the sum of the loss function across rows. In our case, we create a negative log-likelihood column using standard likelihood functions but replacing f(t) and F(t) with pF(t) and pf(t). Next, by specifying the loss as the negative of the log likelihood, an application of the Likelihood Loss option in the nonlinear platform calculates the desired estimates and log-likelihood values. As an alternative, we will also show how slight modifications to the general Excel spreadsheets for Weibull and lognormal maximum likelihood estimation can turn these spreadsheets into defect model maximum likelihood estimation templates. However, for both approaches, good starting parameter guesses are often needed and it may take some trial and error. We start with a JMP data table that is set up for lognormal maximum likelihood estimation using the Nonlinear Analysis Platform. This table is shown in Figure 9.17 with the data from Example 9.5 entered. The formula for the negative log-likelihood column is shown in Section A9.3 of the appendix to this chapter. The JMP table can be downloaded as Ex9.6.jmp. In the JMP menu, we choose Analyze, Modeling, and Nonlinear. Figure 9.18 shows the Nonlinear Analysis entry screen, where we specify the negative log-likelihood column as the Loss entry column. The platform will minimize the sum of the negative log-likelihood column. The starting values for the parameters (entered when the formula was created for the negative loglikelihood column) are shown to be: mu = 8, sig = 1 and p = .5. These were guesses that would have to be adjusted if there are convergence difficulties. Figure 9.19 shows the JMP 8 control panel for the nonlinear estimation routine. Here we check the Loss is Neg LogLikelihood box and click Go. The JMP output after the nonlinear routine converges (several iterations may be needed for convergence) is shown in Figure 9.20. The parameter MLEs are: ˆ = 7.014 (T50 = 1112) , σˆ = 0.35 and pˆ = 0.16. The minimum negative log-likelihood value for this model is 150.539 (called Loss by JMP). By clicking Confidence Limits, LR intervals for the parameters can be obtained, and 95% limits are shown in this output. To complete the analysis we next test whether the defect model gives a statistically significant improvement over a single population fit with p = 1. We can do this by returning to the control panel screen and changing p to equal 1 and locking it to remain at 1 for the analysis—that is equivalent to analyzing the data as if it were standard lognormal reliability data. After checking

326

FIGURE 9.18 JMP nonlinear analysis entry screen.

FIGURE 9.19 JMP nonlinear analysis platform control screen.

Applied Reliability

Alternative Reliability Models

327

FIGURE 9.20 MLEs for Example 9.5 defect model data.

Lock we hit the Reset button and then Go. The control panels, before and after hitting Go are shown in Figure 9.21. The minimum negative log-likelihood value for the standard (p = 1) model is shown to be 153.592 (again labeled Loss). Calculating the LR test of the hypothesis p = 1 we have λ = 2(L1 − L2) = 2(153.592 − 150.539) = 6.106. Looking at chi-square tails (with 1 degree of freedom), we find the probability of getting a λ this large is .0135. Therefore, we reject the nondefect model analysis and use a defect model with p = .16 for the population this sample came from. Next, we look at an Excel solution for this example. Figure 9.22 shows the modified general lognormal estimation spreadsheet set up to do a lognormal defect model analysis for the Example 9.5 data. This worksheet can be downloaded as Ex9.6.xls. For reference, the formula in the Likelihood column H2 cell consists of a series of nested IF statements: = IF(F2 = 0, G2 ⋅ (LN($C$2) + LN((1D2) / ⋅ NORMDIST(LN(D2), $A$2, $B$2, FALSE))),

IF(F2 = 1, G2 ⋅ LN(1 − $C$2 ⋅ LOGNORMDIST(D2, $A$2, $B$2)), IF(F2 = 2, IF(D2 = 0, G2 ⋅ (LN($C$2) + LN(LOGNORMDIST(E2, $A$2, $B$2))), G2 ⋅ (LN($C$2) + (LN(LOGNORMDIST(E2, $A$2, $B$2) − LOGNORMDIST(D2, $A$2, $B$2))))), 0)))

328

FIGURE 9.21 Before and after panels for P = 1 for the nonlinear analysis.

FIGURE 9.22 Excel spreadsheet for MLE fitting of lognormal defect model data.

Applied Reliability

329

Alternative Reliability Models

and this formula is pulled down to the end of the data rows. Note the IF statements are evaluated based on the censor code in column F. If the data is exact, the first IF expression represents the log-likelihood term ri [ln( p) + ln(f (ti ))],

where ri is the failure frequency, p is the population defective proportion, and f(ti ) is the PDF for the lognormal distribution with parameters mu and sigma. If the data is right censored, the second IF expression represents the log-likelihood term ri ln[1− pF (ti )],

where F(ti ) is the lognormal CDF. If the data is interval censored, the final IF expression is itself a nested IF statement and represents the log-likelihood term for two cases: if ti = 0, the log-likelihood term is ri [ln( p) + ln(Fi (tE ))]

where Fi (tE ) is the lognormal CDF at the end of the interval i; otherwise, the term is

{

}

ri ln( p) + ln [Fi (tE ) − Fi (t S )] . where Fi (t S ) is the lognormal CDF at the start of the interval i. Initial guesses are put in A2, B2, and C2. The converged MLEs are shown in A2, B2, and C2 in Figure 9.22, after Solver has been run to vary these cells in order to minimize the value of –Sum Lik in I2. Note that the results closely match those obtained from JMP. EXERCISE 9.6 Use the spreadsheet shown in Figure 9.23 to obtain the MLEs and negative log-likelihood sum for a nondefect model fit to the data. (Hint: Set C2 to 1 and take C2 out of the Solver “changing cells” input line). EXERCISE 9.7 Use the MLE method to estimate the parameters of the defect model for the data in Example 9.5 assuming the test ended at 1665 hours instead of 2000 hours. What does the LR test say in this case?

The last defect model example in this section uses actual multistress life test data from a semiconductor company’s testing of a new component technology. The company name and the laboratories involved have been omitted for confidentiality, but the data has not been changed. EXAMPLE 9.7 MULTISTRESS DEFECT MODEL EXAMPLE The JMP data table in Figure 9.23 shows failure data observed during temperature stress cycling of a sample of components from a new technology. All the data were heavily censored readout data and three different stress levels of “effective delta temperature” were used. The data table has 60 rows with 75 failures and 1982 right censored units. There were three stress cells. The first 10 rows describe interval readouts and censoring times for components undergoing an effective delta temperature stress of 90°. The next 23 rows describe interval readouts and censoring times for components undergoing an effective delta temperature stress of 105.65°. The final 27 rows describe interval readouts and censoring times for components undergoing an effective delta temperature stress of 116.46°.

330

Applied Reliability

FIGURE 9.23 Defect model, multistress cell data.

Previous testing had shown a Weibull life distribution model to be effective for the mechanism and the characteristic life parameter was expected to vary with the effective delta temperature according to the acceleration model

αDelta Effective T = A(Delta Effective T )B ; ln αDelta T = a + B ln(Delta Effective T ) Test results on recent production, however, were showing early failures even in lower effective delta temperature cells. That led the engineers to believe they might be seeing failures from a defective subpopulation and the length of time at stress before these failures occurred might have

331

Alternative Reliability Models

relatively little to do with the stress levels they were undergoing. This hypothesis is what we want to examine during the analysis. More specifically, we seek to answer the following questions:

1. Do the three cells of data behave like one cell with a defective subpopulation? (A total of three parameters to estimate [α, β and p] or only two parameters to estimate if p = 1.) 2. Do we achieve a significantly better fit by treating them as three separate cells with different α’s, one common β, and one overall defective subpopulation p parameter? (A total of five parameters to estimate, or only four if p = 1.) 3. If it appears correct to treat the cells as different but having one overall defective subpopulation p parameter, does the delta temperature acceleration model give a satisfactory fit? (A total of four parameters to estimate [A, B, β, and p], or only three if p = 1.)

Solution The JMP data table shown in Figure 9.23 has self-explanatory entries, except for the negative log-likelihood formulas in the 5th, 6th, and 7th columns (the 8th through 11th columns are indicator variables for designating the effective temperature; for reference, the likelihood formulas in the 5th, 6th,and 7th columns are shown in Appendix. This data table can be downloaded as Ex9.7.jmp). We can use these defect model negative log-likelihood formulas for analyzing many different kinds of multicensored Weibull multicell data—all we need is a start and stop column of times and a censor column with 0 = exact time of fail, 1 = right censored time and 2 = readout interval data (and also, of course, include a frequency column). The formulas shown in appendix 9.4A to this chapter are set up for one cell, three cells with indicator variables in columns I, J, and K, and three cells with a “power law” type of model. These formulas can be easily modified for other configurations. Figure 9.24 shows how to use the nonlinear platform to solve for p and the Weibull parameters, assuming all data comes from one cell (no stress dependencies). Running the other formula columns is similar and the LR test statistics are calculated exactly as shown in previous examples in this chapter. All the negative log-likelihood values needed to test the hypotheses suggested by the three questions asked in Example 9.7 are shown in Table 9.4. Even without calculating LR test statistics and p values, it is obvious that a subpopulation of defectives (p = .045) is clearly a more suitable model than assuming p = 1. However, the LR chi square test statistic (2 degrees of freedom) for going from one cell with three parameters to three cells with five parameters is 2 × (409.03 − 407.75) = 2.56 and this is not even close to being significant. Going from one cell with three parameters to three cells and the delta temp model (four parameters) yields a LR chi square test statistic (1 degree of freedom) of 2 × (409.03 − 407.82) = 2.42 which also is not significant. So, while there might be stress dependencies that follow the proposed model, for all practical purposes, treating the data as one cell with a defective subpopulation of 4.5% is a reasonable approach based on these life test results. The Excel solution for Example 9.8 is covered in Exercise 9.8.

A final word of caution: Although the defect model is very useful and applies in many applications, a slight appearance of curvature on a probability plot does not automatically indicate its presence. Random data can often give the appearance of curvature, especially in the tails, even when all the samples belong to the same population. A long period without further failures is a more reliable clue, and a physical explanation should always be sought. The defect model adds another parameter—so it will often find a better fit even if no defect subpopulation is present. Even if a LR test appears to confirm the presence of a defect subpopulation, the analyst should look for a reasonable physical explanation for the difference between susceptible and nonsusceptible units.

332

Applied Reliability

Loss = negative log – likelihood The estimate of p

FIGURE 9.24 JMP MLEs for one cell Weibull defect analysis.

TABLE 9.4 Negative Log-Likelihood Values for Different Models P 1 1 1 0.0448 0.0488 0.0483

Model and Number of Parameters

Negative Log Likelihood

One cell (two parameters) Three cells, one shape (four parameters) Three cells, Delta Temp Model (three parameters) One cell (three parameters) Three cells, one shape (five parameters) Three cells, Delta Temp Model (four parameters)

415.78 412.55 414.92 409.03 407.75 407.82

EXERCISE 9.8 Download the three spreadsheets Ex9.8with1cell.xls, Ex9.8with3cells.xls, and Ex9.8 withmodel.xls. They are set up to find likelihood values and MLEs for the defect model using the data given in Example 9.7 and an assumption of one cell, three cells, or the stress model, as indicated by the spreadsheet name. When you download these spreadsheets, they will already have initial starting values for the parameters that need to be estimated. You can change these if you have better starting guesses. Use solver to find MLEs and minimized negative log-likelihood values for the three cases. Repeat the Solver runs, this time setting the p parameter to 1 and removing it from the Solver list of parameters to vary. Check that you have reproduced all the numbers shown in Table 9.4.

333

Alternative Reliability Models

EXERCISE 9.9 With minor changes and/or additions, you can use the spreadsheets downloaded in Exercise 9.8 for new sets of experimental data where you suspect a Weibull defect model applies. Describe how you would modify Ex9.8with3cells.xls if you were given experimental data with four different stress cells where it is suspected a Weibull defect model might apply. Next, assume you are told the second stress cell (effective delta temp of 105.65) was really run at the same stress as the first cell (effective delta temp of 90). Modify Ex9.8with3cells.xls accordingly, and run the analysis again. What are the MLEs for p and the two cells of data and what is the minimized negative log-likelihood value? EXERCISE 9.10 Fit a one cell lognormal defect model to the Example 9.7 data. Use JMP as described in Example 9.6 or download Ex9.8lnwith1cell.xls. How does the minimized negative loglikelihood value compare to the Weibull one cell defect model value? EXAMPLE 9.8 DEFECT MODEL DATA TREATED AS TRUNCATED DATA Another way of looking at the failure data given in Example 9.5 is to consider that 15 failures occurred up to 2000 hours and you have no information about how many defective units are left to fail beyond 2000 hours (since you do not know how many of the 85 surviving units actually belong to the defective subpopulation). However, this is the same as right-truncated data (see the definition in Section 2.10 and the likelihood discussion for truncated data in Section 4.5), so we would expect to get the same estimates for the defective subpopulation parameters if we analyzed the Example 9.5 failure data assuming truncation at 2000 hours. Show this is the case (Exercise 9.10 gives hints for a proof ).

Solution Section 4.5 showed how MLEs for left-truncated data could be found by adding a right censoring time at the end of the test with a negative frequency equal to all the units on test. MLEs for righttruncated data are generated by adding a similar “fake” spreadsheet entry: add an interval from time 0 to the right truncation time with a negative frequency equal to all the units that failed on test prior to that time. The Excel spreadsheet, assuming a lognormal model for the Example 9.5 data, is shown in Figure 9.25 after running Solver. Note that the parameter estimates match those in Figure 9.22. We can also use JMP to obtain these MLEs, as shown in Figures 9.26 and 9.27.

After analyzing defective subpopulation data as if they were truncated data, can we estimate p? An estimate equal to the MLE estimate assuming a defect model can be calculated from

n/Fˆ (T ) n pˆ = = ˆ N F(T )N

where T is the truncation time, n is the number of observed failures prior to time T, and N is the total number under observation (including defectives and nondefectives). For the Example 9.5 data, T is 2000, n is 15, and N is 100. Then, Fˆ(T ), using the parameter MLEs shown in Figure 9.25, is 0.9517 and p = 15/95.17 = .1576. This agrees with the previously shown estimate (Figure 9.22). EXERCISE 9.11 Prove that the MLE estimates for the population parameters for the defect model are the same as obtained with a right truncated data analysis approach. Hint: Use an arbitrary two parameter life distribution model F(t; a, b) and write out the two likelihood

334

FIGURE 9.25 Excel MLE fitting of defect model data inputted as truncated data.

FIGURE 9.26 JMP table showing defect model data inputted as truncated data.

Applied Reliability

335

Alternative Reliability Models

FIGURE 9.27 JMP fitting of defect model data inputted as truncated data.

equations. Take logarithms and set partial derivatives with respect to the parameters equal to 0 and show the life distribution parameter MLE solutions for both cases are the same. NO T E : You will also find that the MLE for p is pˆ = n/Fˆ (T )N , where n is the number of

observed failures prior to time T and N is the total number under observation.

9.6 Summary This chapter described several useful models for analyzing reliability data. Step stress models and degradation models extend the scope of the life distribution and acceleration models that were introduced in earlier chapters to cover test situations where stress levels change during a test or where few failures occur but parametric degradation readings are recorded. Lifetime regression modeling allows us to estimate effects and test the significance of different factors that may affect reliability. Fitting defect subpopulation models (and testing whether they offer a significant improvement to the p = 1 model) offers a powerful analysis approach that is likely to have many applications.

Appendix 9A 9.1A JMP Solution for Step Stress Data in Example 9.1 EXAMPLE 9.1A JMP’S NONLINEAR MODELING PLATFORM Redo Example 9.1 using JMP’s nonlinear modeling platform.

Solution We need to set up a JMP data table that calculates lognormal likelihood values for readout data and also adjusts interval endpoints as a function of ΔH. This has been done in the (downloadable) data table Example9.1.jmp (shown in Figure 9.1A). For clarity, we have set up the columns to

336

Applied Reliability

FIGURE 9.1A JMP data table for analysis of Example 9.2 Arrhenius step stress data.

show the actual readout intervals and the equivalent interval times associated with the step stress temperatures referenced to 85°C. All the calculations are done by the formula in the Negative Log Likelihood column. This formula has default (starting value) parameter inputs for mu, sigma, and ΔH (called DH in the formula) of 9, 1, and 1, respectively. Those values produce the Adjusted columns shown in Figure 9.1A. The JMP formula entered in the Negative Log Likelihood column is shown in Figure 9.2A. Go from Analyze to Modeling to Non-Linear and put the Negative Log-Likelihood column in as Loss. The screen should look like Figure 9.3A. One may have to click OK through any error alerts. After clicking Go, the MLEs shown in the Solution section of Figure 9.4A are obtained. The minimized negative log likelihood is called Loss in this screen. N OT E : This will only work with JMP release 6 or later.

9.2A Lifetime Regression Solution Using Excel We will analyze the plant/process data of Example 9.4 using Excel. We start with the general Weibull estimation spreadsheet described in Chapter 4 (a lognormal analysis would start with the general lognormal estimation spreadsheet described in Chapter 5). These can be downloaded under the names Genweibest.xls for the Weibull or Genlnest.xls for the lognormal. Add seven new columns right after column B and give them the headings shown in Figure 9.5A. Also note the change to the heading in column A. Columns F, G, H, and I are indicator columns that have 1s for rows with data from the appropriate plant of process and 0s otherwise. Columns B, C, D, and E are the parameter columns, with initial guesses for these quantities placed in the second row. Definitions for intercept and the plant and process factors follow JMP’s version of the lifetime regression model. The exact failure times and right censoring times are entered in column J. The censoring codes and counts that correspond to the entries in column J go in columns L and M, respectively. The first 35 rows of the resulting spreadsheet are shown in Figure 9.5A. This spreadsheet, Ex9.5.xls, can be downloaded. A new formula is needed in cell A2. For the full model, with both plant and process effects, A2 has the formula

= $ C$2 + $D $2 ⋅ F2 − $D $2 ⋅ G2 + $ E$2 ⋅ H2 − $ E$2 ⋅ I2

and this is pulled down to the end of the A column. Only one other change is needed before running Solver. Every place in the log-likelihood formula, now in N2, where $A$2 appears—change it to A2. Then pull down N2 to

Alternative Reliability Models

FIGURE 9.2A Formula for Weibull, one cell, defect model.

337

338

FIGURE 9.3A Formula for Weibull, three cells, defect model.

FIGURE 9.4A Formula for Weibull, effective delta temperature acceleration, defect model.

Applied Reliability

Alternative Reliability Models

339

FIGURE 9.5A EXCEL spreadsheet for the plant process field reliability data.

the end of the N column and the setup work is done. For reference, the formula in N2 should now be = IF(L2 = 0, M2 ⋅ (LN($B$2) − LN(K 2) + $B$2 ⋅ (LN(K 2) − A 2) − ((K 2 /EXP(A 2)) ^ $B$2)), IF(L2 = 1, M2 ⋅ LN(1 − WEIBULL(J2 , $B$2 , EXP(A2), TRUE)), IF(L2 = 2 , IF( J2 = 0,

M2 ⋅ LN(WEIBULL(K2 , $B$2 , EXP(A2), TRUE)), M2 ⋅ (LN( WEIBULL(K2 , $B$2 , EXP(A2), TRUE) − ( WEIBULL(J2 , $B$2 , EXP(A2), TRUE))))), 0)))

After putting starting guesses in B2, C2, D2, and E2 and running Solver (see Figure 9.6A), you will get the results shown in Figure 9.5A (or you may need to pick new starting values). Note that all the parameter estimates agree closely with JMP and the minimized negative log likelihood agrees with Minitab.

340

FIGURE 9.6A Solver screen for the plant/process field reliability data.

FIGURE 9.7A JMP negative log-likelihood column formula for the lognormal defect model.

Applied Reliability

Alternative Reliability Models

341

EXERCISE 9.1A Set D2 to 0 and run Solver again, this time leaving $D$2 out of the list of cells to change. That will produce a model fit where the plant effect is assumed zero. Use the minimized negative log-likelihood value in O2 to construct a LR test for the significance of the plant effect. Repeat the analysis, this time setting E2 to 0 and leaving $E$2 out of Solver (but not $D$2). You should get the same results for p values as obtained by JMP in Figure 9.11.

9.3A JMP Likelihood Formula for the Defect Model The formula for the negative log-likelihood column for the JMP table shown in Figure 9.17 is shown in Figure 9.7A. If we substitute 1 for the parameter p, this formula reduces to the standard negative log likelihood for lognormal reliability data.

9.4A JMP Likelihood Formulas for Example 9.7 Multistress Defect Model Example

FIGURE 9.8A Formula for Weibull, one cell, defect model.

342

FIGURE 9.9A Formula for Weibull, three cells, defect model.

FIGURE 9.10A Formula for Weibull, effective delta temperature acceleration, defect model.

Applied Reliability

343

Alternative Reliability Models

Problems 9.1 A single cell step stress experiment was run for a component family believed to follow an Arrhenius/power model for temperature/voltage acceleration and a Weibull life distribution. A total of 150 test units underwent voltage and temperature stresses as described in Table 9.5 and the failures during each readout interval were recorded. Use the data in the table to estimate ΔH and the voltage exponent B and the Weibull shape parameter β. Finally, project a use CDF at 100,000 hours. 9.2 100 components are put on stress test. Readouts occur at 25, 50, 100, 250, and 500 hours. The numbers of new failures discovered at the respective readouts are: 15, 9, 12, 5, and 1. Plot the data on lognormal probability paper. Does the fit appear reasonable? Repeat the plotting assuming a defect model applies and trying different values for the fraction susceptible to failure. Does the fit improve? Use an iterated least squares graphical procedure to estimate T50 and sigma for the defect subpopulation and the fraction defective. Next, estimate the parameters using the method of maximum likelihood and test whether the assumption of a defect subpopulation is warranted. 9.3 50 components are put on stress test. Readouts are taken at 10, 25, 50, 100, 200, 500, and 1000 hours. The numbers of new failures discovered at the respective readouts are 2, 2, 4, 5, 4, 3, and 0. Repeat the plotting and estimation procedures described in Problem 9.2, this time assuming a Weibull model for the defect subpopulation. 9.4 Write the maximum likelihood equation for Type I censored data, assuming a defect model applies and the exponential distribution models the defective subpopulation adequately. Derive the equations that must be solved to obtain MLEs (by taking partial derivatives of the log likelihood). TABLE 9.5 Step Stress Data for Problem 9.1 Readout Start 0 24 72 168 500 1000 1200 1400 1600 1800 2000 2100 2200 2300 2400

Readout End

Voltage (V)

Temperature (°C)

Fails during Interval

24 72 168 500 1000 1200 1400 1600 1800 2000 2100 2200 2300 2400 2500

6 6 6 6 6 6 9 9 12 12 12 12 12 12 12

125 125 125 125 125 125 125 125 90 90 110 110 125 125 125

0 1 10 21 38 15 16 18 1 1 5 0 9 4 4

Note: Remaining unfailed at end of test: seven components.

This page intentionally left blank

10 System Failure Modeling: Bottom-Up Approach The last two chapters have described how to estimate reliability distribution parameters and failure probabilities for components operating at typical use conditions. How can these probabilities be used to predict total system performance? How does the design of the system affect reliability? What are the benefits of redundant design? This chapter will answer questions like these and also develop general reliability algorithms that can be used to project system reliability and evaluate the failure rate improvement expected by adding a “burn-in” step to the manufacturing process. Note, however, that this chapter only considers the first failure reached by a system—or the failure rate (of first system failures) for a population of similar systems. Repair rates for repairable systems will be discussed in Chapters 12 and 13.

10.1 Series System Models The most commonly used model for system reliability assumes that the system is made up of n independent components, which all must operate in order for the system to function properly. The system fails when any component fails. This model is called a series, first fail, or chain model system. Although the independence assumption or the first fail assumption may not be strictly valid for an actual system, this model is often a reasonable and convenient approximation to reality. In Section 2.2, we derived the formula for the reliability of a series system composed of n identical independent elements, using the multiplication rule for probabilities. Now we generalize that case to a series system of n, possibly all different, components. Let the ith component have the reliability function Ri (t). Then, the probability the system survives to time t, denoted by the system reliability function RS (t), is the probability that all the components simultaneously survive to time t. Under the independence assumption, this probability is the product of the individual probabilities of survival (multiplication rule). These probabilities are just the Ri (t). The expression is n

RS (t) =

∏ R (t) = R (t) × R (t) × × R (t) 1

i

2

n

i=1

or, in terms of the cumulative distribution function (CDF) functions, n

FS (t) = 1 −

n

∏ (1 − R (t)) =∏ F (t) i

i=1

i

i=1

345

346

Applied Reliability

For system failure rates, the relationship is even simpler: n

hS (t) =

∑ h (t) i

i=1

n

AFR S (T1 , T2 ) =

∑ AFR (T , T ) i

1

2

i=1

These equations show that for a series system, the failure rate can be calculated by summing up the failure rates of all the individual components. There are no restrictions on the types of distributions involved, and the result is exact, not an approximation. The only requirements are the independence and the first fail assumptions. The proof of this convenient formula is very easy and is outlined in Example 10.1. EXAMPLE 10.1 SERIES SYSTEMS Derive the additivity relationship for series system failure rates by using the fact that the failure rate function can be defined as the negative derivative of the natural logarithm of the reliability function (see Section 2.4 and Problem 2.1).

Solution We have n

n

− ln Rs (t ) = − ln

∏

Ri (t ) =

i =1

∑ − ln R (t ) i

i =1

Therefore, the failure rate is

d ( − ln Rs (t )) = dt

n

∑− i =1

d (ln Ri (t )) dt

The additivity of failure rates follows immediately from the last equation.

10.2 The Competing Risk Model (Independent Case) The series model formulas apply in another important case. A single component with several independent failure modes is analogous to a system with several independent components. The failure mechanisms are competing with each other in the sense that the first to reach a failure state causes the component to fail. The series system probability arguments again apply, and the reliability of the component is the product of the reliability functions for all the failure modes. Failure rates are additive, mechanism by mechanism, to get the failure rate of the component. The more general competing risk model where the failure processes for different mechanisms are not independent can be very complicated since one must know how the random times of failure for different mechanisms are correlated. This general model will not be treated in this text.

System Failure Modeling: Bottom-Up Approach

347

EXAMPLE 10.2 BOTTOM-UP CALCULATIONS A home computer has most of its electronics on one board. This board has 16 memory modules, 12 assorted discrete components, and a microprocessor. The memory modules are specified to have an exponential failure rate with λ = 0.01%/K. The discrete components each have a Weibull CDF with β = 0.85 and α = 3,250,000 hours. The microprocessor is thought to have two significant failure mechanisms. Each mechanism was modeled based on accelerated testing designed to cause that type of failure. The results, adjusted to normal use conditions, yielded a lognormal distribution with σ = 1.4 and T50 = 300,000 for one mode and an exponential with λ = 0.08%/K for the other mode. Assuming that all components and mechanisms operate independently (at least until the first fail), what is the board failure rate at 5000 hours? What is the chance a board has no failures in 40,000 hours?

Solution First, we derive the microprocessor failure rate by adding the lognormal and exponential competing failure mode failure rates, evaluated at 5000 hours. This sum is 1600 PPM/K. The reliability at 40,000 hours is the product of the exponential and the lognormal reliability functions. The result is 0.896. For this part of the example, the competing risk model was used. Using the series model, we add the failure rates of the 16 memory modules together and obtain 1600 PPM/K. The product of their exponential reliability functions at 40,000 hours is 0.938. The 12 discrete components each have a Weibull failure rate of 700 PPM/K at 5000 hours, adding 8400 PPM/K to the board total. The product of the 12 reliability functions, evaluated at 40,000 hours, is 0.7515. The sum of all the failure rates is 11,600 PPM/K, or 1.16%/K, for the board total failure rate at 5000 hours. The probability the board lasts 40,000 hours without a failure is the board reliability, or the product of the component reliabilities: 0.896 × 0.938 × 0.7515 = 0.632.

This last example shows how, starting with individual failure mode models and using the competing risk and series model as building blocks, a bottom-up calculation of subassembly or system failure rates is done. Since testing on the system level is usually limited due to time and cost constraints, this bottom-up approach is of great practical value to develop projected failure rates that can be used as targets. EXERCISE 10.1 Assume that the shape parameter for the 12 discrete components in Example 10.2 is 1.0 instead of 0.85. What is the board failure rate at 5000 hours? What is the reliability at 40,000 hours?

The competing risk model assumed that the failure mechanism CDFs were known in the proceeding example. How do we analyze life test data when there are multiple failure modes and we need to estimate separate CDFs for them? Assume that we have run a life test of a component and are in the middle of plotting the failure data and running analysis programs when we are told by the failure analysis engineers that they can divide the failures into three very distinct modes. Moreover, based on the nature of the failure mechanisms, you might want to use different life distributions to model these modes. How can you statistically separate the different failure distributions and estimate their parameters? This subject was mentioned briefly in Section 2.11. Assuming independence of failure modes (i.e., they are not “looking over their shoulder at each other” and changing their probability of failure depending on how the other is progressing) and a first fail model, we can treat the data mode by mode as multicensored data. When we are analyzing mode 1, all mode 2 and 3 failures are “censored units” taken off test at the failure time. Multicensored data can be plotted using either hazard plotting procedures or the Kaplan–Meier product probability method (see Chapters 6 and 7). Good maximum likelihood estimation

348

Applied Reliability

programs will be able to analytically handle multicensored data and estimate distribution parameters. The key point is to go through the analysis one mode at a time, treating all other modes as units taken off test. The ability to separate data points by failure mode is critical to the analysis. This separation should be supported with physical analysis; trying statistically to separate failure modes is difficult, especially if the distributions have considerable overlap. Occasionally, a component with several known failure mechanisms has zero failures on life test. Even if we are willing to assume that each mode has an exponential life distribution, is there anything we can do with zero failures? Say there are three failure modes with unknown failure rates λ1, λ2, and λ3. Then, the component has the constant failure rate of λ = λ1 + λ2 + λ3, and we can use the zero failures formulas of Chapter 3 to put an upper bound on λ. Unfortunately, this same upper bound applies individually to each of the three mechanism failure rates. So, we have the paradoxical sounding result that the upper bound on the total is one-third the sum of the upper bounds of the parts. If we have known acceleration factors A1, A2, and A3, for the three modes, the best we can do is use the minimum of these to derive a use condition upper bound from the component test upper bound.

10.3 Parallel or Redundant System Models A system that operates until the last of its components fail is called a “parallel or redundant system.” This system model is the extreme opposite from the series model where all components must work. Parallel systems offer large advantages in reliability, especially in early life. In applications where good reliability and low front-end failure rates have higher priority than component cost, designing in redundancy, at least for key parts of the system, is an often used option. The computer systems on the space shuttle are an example of this concept: every system is replicated with several backup copies—even including independent versions of key software. As before, let the ith system component have CDF Fi(t). The probability the system fails by time t is the probability that all the components have failed by time t. This probability is the product of the CDFs, or, n

FS (t) =

∏ (F (t)) i

i=1

n

n

RS (t) = 1 −

∏ i=1

Fi (t) = 1 −

∏ (1 − R (t)) i

i=1

We are assuming that all components are active. Failure rates are no longer additive (in fact, the system failure rate is smaller than the smallest component failure rate) but must be calculated using basic definitions. EXAMPLE 10.3 REDUNDANCY IMPROVEMENT A component has CDF F(t) and failure rate h(t). The impact of this failure rate makes a significant adder to a system currently under design. In order to improve reliability, it is proposed to use two of these components in a parallel (active redundancy) configuration. Show that the improvement can be expressed as a factor k given by

System Failure Modeling: Bottom-Up Approach

k=

349

1 + F (t ) 2F (t )

where the old failure rate is k times the new failure rate (or the new failure rate is 1/k times the old failure rate). How much improvement results when F(t) = 0.01 compared to later in life when F(t) = 0.1 or F(t) = 0.5?

Solution The CDF of the two components in parallel is F2(t) and the probability density function (PDF), by differentiating, is 2F(t)f(t). The failure rate of the pair is (leaving out the time variable t for simplicity)

hs =

2Ff 2Ff 2F = = h 1 − F 2 (1 + F )(1 − F ) 1 + F

This result shows that h = k × hs, with k as given above. When F = 0.01, k = 50.5, or improvement of about 50 times. When F is 0.1, k is only 5.5. For F = 0.5, the failure rate improvement drops to 1.5 times. Thus, redundancy makes a large difference early in life when F is small, and much less of a difference later on. The rule of thumb is that one gains by a factor of about (1/2F). This example can easily be generalized as follows: if a single component with CDF F is replaced by n active components in parallel, then the failure rate is improved by the factor

k=

1 + F + F 2 + + F n−1 nF n−1

There is about (1/nFn−1) times improvement in early life (when F and higher powers of F are very small). EXAMPLE 10.4 MAXIMIZING RELIABILITY USING REDUNDANCY Consider the following situation: We have a stockpile of n different components that are used to construct a system. Each system uses one of each type of all n components in series. We have a choice of constructing two complete systems operating in parallel or one system in which each component is replicated to provide n pairs of components operating in parallel. Does replication at the component level provide more reliability than replication at the system level?

Solution Let us consider an example involving n = 3 components (A, B, and C) with respective reliabilities of R A, RB, and RC. If we construct a system in which each component must operate for the system to function, the series formula shows that the system reliability is R ARBRC. For two identical systems in parallel, the probability that one system is working is

R2systems = 1− (1− RARBR C )(1− RARBR C ) Consider now a single system in which each of the three components is replicated as parallel pairs. The probability that the system is functioning is

Rcomp_pairs = 1− (1− R A )(1− R A ) 1− (1− RB )(1− RB ) 1− (1− R C )(1− R C ) As an illustration, let us now assume that the component reliabilities are R A = RB = RC = 0.90. Replication at the system level has an overall reliability of

R2systems = 1 − (1 − 0.93 )(1 − 0.93 ) = 0.927

350

Applied Reliability

Replication at the component level has an overall reliability of

3

Rcomp_pairs = [1 − (1 − 0.9)2 ] = 0.970 We see that replicating components leads to a higher reliability than replicating systems. This result, true for any value of R, is rigorously proved as Theorem 9.1 in Ross (2003). EXERCISE 10.2 The microprocessor in Example 10.2 contributed more to the board failure rate than any other single component. What would the new failure rate at 5000 hours and the new board reliability at 40,000 hours be if a second (redundant) microprocessor is added to the board?

Note that the parallel model assumes that the redundant components are operating (i.e., active) all the time, even when there have been no failures. An alternative setup would be to have the redundant components in a backup or standby mode, only being called on to operate when needed. This model will be discussed in Section 10.4.

10.4 Standby Models and the Gamma Distribution We treat here only the simple case where one or more identical units are on hand to be used only as necessary to replace failed units. No allowance will be made for the failure rate of a switching device. In practice, this element would have to be added into the overall system calculation. The lifetime until a system failure is the sum of all the lifetimes of the original and standby components as each component operates sequentially until failure. For n components in the original plus standby group, the system lifetime is

Ts = T1 + T2 + + Tn

where the Ti independent random times each has CDF F(t). For n = 2, the CDF for Ts can be derived using the convolution formula for the distribution of the sum of two independent random variables. For this application, the convolution is t

F2 (t) =

∫ F(u) f (t − u)du 0

If we now add a third-component lifetime and do another convolution, we derive F3, and so on until for Fn we have t

Fn (t) =

∫F

n− 1

(u) f (t − u) du

0

For complicated life distribution, such as the Weibull or the lognormal, the convolution integrals would have to be evaluated numerically. In the exponential case, however, the calculations are much simpler.

System Failure Modeling: Bottom-Up Approach

351

EXAMPLE 10.5 STANDBY MODEL A subassembly has a high exponential failure rate of λ = 2%/K. As an insurance backup, a second subassembly is kept in a standby mode. How much does this reduce the failure rate when the subassembly CDF is 0.01? What about when the CDF is 0.1 or 0.5?

Solution Substituting the exponential CDF and PDF into the convolution formula gives t

Fs (t ) =

∫ λ(1− e

− λu

)e − λ(t −u ) du = 1− λt e − λt − e − λt

0

The standby model PDF is the derivative of this formula, or

fs (t ) = λ 2t e − λt Using 0.00002 for λ, the time when F(t) is 0.01 corresponds to 502.5 hours. F(t) is 0.1 and 0.5 at times 5,268 and 34,657 hours, respectively. Using these times and λ value, we find that the improvement over 2%/K, by calculating hs from fs /(1 − Fs), is about 100 times when the CDF is 0.01, and 10 times when the CDF is 0.1. The improvement factor is only about 2.4 times when the CDF is 0.5. At the early times, the failure rate improvement given by the standby model with two exponential components was twice that given by the parallel model (see Example 8.3).

We can generalize Example 8.4 to the case of an r level standby system of exponential components. The system lifetime PDF can be derived, by repeated convolutions, to be

f s (t) =

λ r t r −1 e− λt (r − 1)!

This expression is the PDF of the gamma distribution. The parameters are λ and r. The PDF has values only for nonnegative t, and λ and r must be positive numbers. The mean time to failure (MTTF) is r/λ, and the variance is r/λ2. In our derivation, r can have only integer values (the number of identical exponential components in the standby system). The general gamma distribution allows r to take on noninteger values, and uses the gamma function (described in Chapter 4) to replace factorials. In this form of the PDF

f s (t) =

λ r t r −1 e− λt Γ(r )

This is a very flexible distribution form, and it is often used empirically as a suitable life distribution model, apart from its derivation as the distribution of a sum of exponential lifetimes. If we have an r standby exponential model, leading to the above gamma distribution, it can be shown using approximations described in Gnedenko, Belyayev, and Solovyev (1969) that the improvement factor is approximately r! times greater in early life than the improvement obtained from the parallel system model. For r = 2, this gives a two-time improvement, as seen in Example 10.4. EXERCISE 10.3 For the subassembly in Example 10.5, compute the approximate failure rate when the CDF is 0.01 for the case where two standby subassemblies are added to back up

352

Applied Reliability

the original subassembly. (Hint: Calculate the same failure rate if the two additional standby units are in a parallel redundant mode and reduce this number by the improvement factor for standby vs. redundant models.) Another special case of the gamma is of interest. When λ = 0.5 and r is an integer, by substituting r = d/2, the PDF becomes

(1/2)d/2 t( d − 2 )/2 e− t/2 Γ(d/2)

f (t) =

This is the chi-square distribution with d degrees of freedom, which was used to obtain exponential confidence bounds in Chapter 3.

10.5 Complex Systems Models for systems that continue to operate as long as certain combinations of components are operating can be developed with great generality (see Barlow and Proschan 1975). Here we discuss two types of complex system: k-out-of-n systems, which operate as long as at least k components (any k) out of n identical components are working, and those that can be diagrammed as combinations of series and parallel (not necessarily identical) components. The formula for the reliability function when at least k out of n components must work is obtained by summing the probabilities “exactly k,” “exactly k + 1,” and “exactly k + 2,” and so on all the way up to “exactly n” work. These exact cases are all disjoint events, and the sum is the probability at least k out of n are working. Each of these probabilities can be evaluated using the binomial formula (see Chapter 11). The result is n

Rs (t) =

∑ i= k

n i

R i (t)[1 − R(t)]n− i

where

n n! = i n!(n − i)!

and

n ! = n × (n − 1) × (n − 2) × × 1

EXERCISE 10.4 For a certain type of airplane to fly, at least two out of its three engines must function. The engine reliability at time T hours is 0.995. Find the probability the airplane flies successfully for T hours assuming identical and independent engines. EXERCISE 10.5 Repeat Exercise 10.4 for engine reliability at T hours of 0.99, and again for engine reliability of 0.95. Note how the probability of airplane failure increases by orders of magnitude.

353

System Failure Modeling: Bottom-Up Approach

EXAMPLE 10.6 EXPECTED LIFETIME OF K-OUT-OF-N SYSTEM OF INDEPENDENT EXPONENTIALLY DISTRIBUTED COMPONENTS For a k-out-of-n system to be functioning, k components must be operational, which means that n − k components can fail before the system is no longer functional. When the (n − k + 1) component fails, the system fails. The lifetime of a k-out-of-n system can, thus, be written as T1 + T2 + + Tn− k +1, where Ti is the time between the (i − 1) and ith failure. We see that T1 + T2 + + Tn− k +1 is the time when the number of working components first becomes less than k. When all components are working, the system failure rate is n/θ, where θ is the MTTF for the exponential distribution. Consequently, the time to first failure T1 is exponentially distributed, with mean θ/n. Similarly, since the time Ti represents the time until the next failure when n − (i − 1) components are working, Ti is exponentially distributed with mean θ/(n − i + 1). It follows that the expected mean lifetime for a k-out-of-n system is

n

E [T1 + T2 + + Tn− k +1 ] = θ

1 1 1 1 + + =θ n n−1 k i i=k

∑

Note that the memoryless property of the exponential distribution assures us that the Ti, i = 1, … , (n − k + 1) are independent random variables. A rigorous proof of this result can be found in Ross (2003). EXERCISE 10.6 A space satellite is exposed to severe radiation while in orbit. For a specific satellite, the engineers designed the system to operate if at least two of the four radiation-hardened components were functional. The lifetimes of these independent components are exponentially distributed with MTTF = 50,000 hours. What is the expected lifetime of this satellite system?

The CDF and failure rate for this model are derived from the reliability function using basic definitions. Many systems can be broken down into combinations of components or subassemblies that are in parallel configurations, and combinations that are in series. These systems can be diagrammed like an electric circuit, with blocks logically “in parallel” and blocks logically “in series.” The system “working” means that there is a path for electricity to flow from one end of the diagram to the other. The system may or may not actually have electronic parts—or it may have a combination of electronic and mechanical components. The electric circuit diagram is used only as a convenient device that helps us reduce the system, by successive steps, to simpler systems with equivalent failure rates. The three steps involved in this method are as follows:

1. Diagram the system as if it were an electric circuit with parallel and series components and groups of components. Display each component as a circle and write Fi within the circles of all the components that have the CDF function. 2. Successively reduce combinations of components by replacing, for example, a group of components that are in series by one equivalent component. This step will produce a large circle that has 1 minus the product of the Rs from each circle it replaced as its CDF. For components in parallel, the equivalent component has a CDF calculated using the parallel model formula (the product of the individual circle CDFs). 3. Continue in this fashion until the entire system is reduced to one equivalent single component whose CDF is the same as that of the entire original system.

This procedure sounds complicated and arbitrary. Actually, it turns out to be fairly automatic after a little practice. A few examples will illustrate how it works.

354

Applied Reliability

F3 F1

F2

F5 F4

FIGURE 10.1 Five-component system diagram.

F1

F2

F3 F4

F5

FIGURE 10.2 Reduced five-component system diagram.

FS = 1 − (1 − F1)(1 − F2)(1 − F3F4)(1 − F5)

FIGURE 10.3 Fully reduced five-component system diagram.

EXAMPLE 10.7 COMPLEX SYSTEM REDUCTION (FIVE COMPONENTS) A system has five different parts. Three of them must work for the system to function. If at least one of the remaining two components is working, along with the first three, the system will function. What is the system reliability function?

Solution The analogous circuit is drawn in Figure 10.1. The simplest way to proceed is to replace the parallel block containing F3 and F4 by one component of equivalent reliability (Figure 10.2). Now all the components are in series, and a straightforward application of the series model provides the answer (Figure 10.3). EXAMPLE 10.8 COMPLEX SYSTEM REDUCTION (SIX COMPONENTS) A system of six components can be broken up logically into three subassemblies. The first has three components, two of which are the same, and as long as any one of the three works, the subassembly will work. The second subassembly has two identical components, either of which must work for this part of the system to function. The last (logical) subassembly consists of one critical part. The system works only as long as each subassembly functions. What is the system reliability?

Solution The diagram for the system (labeled 1), followed by the three successive reduction steps solving for the system CDF, are shown in Figure 10.4. The hardest part of this procedure turns out to be making the initial diagram of the system. Not all systems can be broken down this way, even as an approximation. A simple generalization, left to the reader, would be to allow blocks of “r out of n” configured components. The single equivalent component replacing such a block would have a reliability function given by the binomial formula shown at the start of this section.

355

System Failure Modeling: Bottom-Up Approach

F1

F3

F1

F4

F3

F2 F1 F1

F32

F4

F32

F4

F2 F12F2

FS = 1 − (1 − F12F2)(1 − F32)(1 − F4) FIGURE 10.4 Six-component system diagrams. A

B

C

D

A

B

FIGURE 10.5 Backup components.

E C

D

FIGURE 10.6 Backup components with switch.

EXERCISE 10.7 A system has 12 components. Components 1 through 8 are different and have CDFs of F1, F2, … , F8. Components 9 through 12 are the same, with CDF F. Components 4, 5, and 6 are critical, and each must operate for the system to function. However, only one of components 1, 2, and 3 has to be working, and the same goes for components 7 and 8. At least two of the four identical components must work, as well. Diagram the system and write the probability the system survives.

Sometimes, the diagram methods given in this section can be used in combination with simple probability arguments to solve complex system models. For example, Figure 10.5 shows a simple parallel or redundant situation where the components A and B are backed up by C and D. Say A is a logic processor for a computer and B is a memory area with critical data, and C and D are redundant backups for A and B, respectively. Obviously, this configuration is much more reliable than just having A and B. However, it is also flawed as a design, because if B and C fail we still have a processor and the critical data, but they are not linked together. What we need is a switch, shown in the schematic as component E (Figure 10.6).

356

Applied Reliability

A

B

C

D

FIGURE 10.7 Equivalent diagram of system with working switch.

The probability the system survives until time t can be written as P(system survives to t) = P(system survives to t|E survives)P(E survives) + P(system survives to t|E fails)P(E fails) (see the law of total probabilities in Chapter 2). The system surviving, given the switch works, is diagrammed by Figure 10.7 or two parallel blocks in series with each other. The system surviving without the switch is given by the original design. Using the diagram reduction rules, we obtain

Rs = (1 − FA FC )(1 − FB FD )RE + [1 − (1 − RA RB )(1 − RC RD )]FE EXERCISE 10.8 Find the system reliability if a third backup system processor F and memory G are added and the switch E is able to couple any working processor to any working memory.

10.6 System Modeling: Minimal Paths and Minimal Cuts There are other more formal methods of drawing and decomposing system reliability diagrams that are more general and better suited for use in constructing analysis algorithms. This section will describe how to calculate system reliability using either minimal paths or minimal cuts. Consider any of the previous system diagrams. A path is defined as a sequence of components such that if all of them operate, the system operates. In other words, there is a path of operating components that allows you to go from the far left to the far right. A minimal path is a path where every component is critical to the success of the path—if any one of these components fail, that path no longer allows successful system operation (note that often the term “tie” is used instead of the term “path”). If we write down all the minimal paths we can find in a system diagram, then each will be a series block and the system operates if at least one of these blocks of paths has components that all function. This means that the series blocks of minimal paths are all in parallel, and we have already seen how to evaluate system diagrams with parallel blocks when the components in any block are independent of the components in any other block. However, path block diagrams are not actual physical system diagrams (they are a “logical” decomposition of ways the system can operate), and the same component can appear in more than one block. When two blocks have one or more components that are the same, they are no longer independent. This situation means we have to adjust the way we compute the probability of at least one path working to allow for the case where several paths may contain the same component. Fortunately, that adjustment is easily done as described in step 4 of the rules for calculating system reliability using minimal path analysis:

357

System Failure Modeling: Bottom-Up Approach

1. Label all the components with different letters and find all the minimal paths that allow the system to operate. 2. Draw a logical system diagram where all the minimal paths are series blocks and all these blocks are in parallel. The same component letter may appear in more than one block. 3. Write the CDF for each series block as 1 minus the product of the component reliabilities for the corresponding path. Use symbols like RA, RB, … , for each component reliability. For example, a minimal path with the components {A, C, D} will have CDF of (1 − RARCRD) and a minimal path with components {B, C, E} will have a CDF of (1 − RBRCRE). 4. Multiply all these path CDFs together and examine the result. Any term that has a component reliability raised to a power should be changed so that the exponent is 1. This adjusts for nonindependence. The result is the system CDF. For example, when the two CDFs given in step 3 are multiplied together, the result is [1 − RA RC RD − RB RC RE + RA RB RC2 RD ] , which is adjusted to [1 − RA RC RD − RB RC RE + RA RB RC RD ]. 5. Now you can substitute the actual equations for the reliability of each component. Note that if two or more physically distinct components have the same reliability function, you may now have squared or higher powered terms appearing. An example will illustrate this way of analyzing system reliability. EXAMPLE 10.9 MINIMAL PATH ANALYSIS Consider the system diagram, called a bridge structure, in Figure 10.8. If among the components (A, B, C, D, and E), enough are operating to go from the left to the right, the system is functional. The minimal paths are just (A, B), (A, C, E), (D, E), and (D, C, B). Each minimal path can be thought of as a block of components in series, and the whole system has all these blocks in parallel. Accordingly, the system reliability can be calculated from the system diagram in Figure 10.9. The four blocks are in parallel, and the system CDF, prior to adjustment for exponents greater than 1, is (1− R ARB )(1− R ARCR E )(1− R DR E )(1− R DR CR B ). After multiplying this out and replacing any exponents higher than 1 with 1, the final result is FS = 1− RARB − RDRE − RARCRE − RDRCRB

+RARBRDRE + RARCRDRE + RARBRCRD + RBRCRDRE − 2RARBRCRDRE Instead of looking at paths, we can look at sets of components that have to all fail in order for the system to fail. These are called cut sets, and a minimal cut set has the property that every

A

B C

D

FIGURE 10.8 Bridge structure system diagram.

E

358

Applied Reliability

A A

B C

D

D

E

E

C

B

FIGURE 10.9 Equivalent to bridge structure system diagram.

component in it is necessary for it to be a cut set. Each minimal cut set is a block of components that are in parallel, and the whole system can be diagrammed by putting these blocks together in series. Here are the five steps for finding the system CDF:

1. Label all the components with different letters, and find all the minimal cuts where all have to fail in order for the system to fail. 2. Draw a logical system diagram where all the minimal cuts are blocks of parallel components and all these blocks are in series. The same component letter may appear in more than one block. 3. Write the CDF for each block as the product of the component CDFs for the cut. Use symbols like R A, RB, … , for each component reliability. For example, a minimal cut with the components {A, C, D} will have CDF of [(1 − R A)(1 − RC)(1 − RD)]. A minimal cut with components {B, C, E} will have a CDF of [(1 − RB)(1 − RC)(1 − RE)]. The reliability function for any block is 1 − the block CDF. 4. Multiply all these block reliability functions together and examine the result. Any term that has a component reliability raised to a power should be changed so that the exponent is 1. This change adjusts for nonindependence. The result is the system reliability, and 1 minus this is the system CDF. 5. Now you can substitute the actual equations for the reliability of each component. Note that if two or more distinct components have the same reliability function, you may now have squared or higher powered terms appearing. EXAMPLE 10.10 MINIMAL CUT SET ANALYSIS Using the same bridge structure shown in Example 10.9, the minimal cut sets are (A, D), (B, E), (A, C, E), and (D, C, B). The system can be diagrammed as given in Figure 10.10. The CDF for this diagram, prior to adjustment for terms with exponent higher than 1, is 1 − (1 − FA FD )(1 − FBFE )(1 − FA FC FE )(1 − FDFC FB ) . The reader can verify that the final system CDF expression, after multiplying terms and adjusting for exponents higher than 1, is the same as that obtained in Example 10.8. EXERCISE 10.9 Use minimal path analysis to solve for the CDF (or reliability function) in Examples 10.5 and 10.7. In both cases, you will find that the minimal paths have no repeated components and lead to a quick evaluation of the system CDF. Minimal path or minimal cut analysis can even be used when there are “k-out-of-n” blocks in a system (see the beginning of Section 10.5).

359

System Failure Modeling: Bottom-Up Approach

A

B

D

E

A

D

C

C

E

B

FIGURE 10.10 Minimal cut analysis of bridge structure diagram.

A

2/3 block

B C

E

D FIGURE 10.11 Example 10.10 system diagram.

EXAMPLE 10.11 MINIMAL PATH ANALYSIS WHEN “K-OUT-OF N” BLOCKS ARE PRESENT The five-component system whose physical diagram is shown in Figure 10.11 will operate as long as at least any two of the three components B, C, and D operate and both A and E also operate. A has CDF F1; B, C, and D all have CDF F2; and E has CDF F3. Use minimal path analysis to derive an expression for the system reliability and show you get the same result if you use the “at least r out of n” formula given at the beginning of Section 10.5.

Solution The minimal paths are {A, B, C, E}, {A, B, D, E}, and {A, C, D, E}. The system reliability, prior to adjustment for exponents greater than 1, is given by

1− (1− RARBRCRE )(1− RARBRDRE )(1− RARCRDRE ) When these factors are multiplied and adjusted to have no exponents greater than 1, the result simplifies to

RARBRCRE + RARBRDRE + RARCRDRE − 2 RARBRCRDRE Now, we substitute the given CDFs or corresponding reliability functions for the components and obtain

RS = 3R1R22R3 − 2R1R23R3 Next, we apply the formula for “at least r out of n” given in Section 10.5 and get the following expression for the system reliability: 3

Rs = R1

∑ r=2

3 r

R2r (1− R2 )3− r R3

= 3R1R22R3 − 2R1R23R3

360

Applied Reliability

The two methods give the same answer, showing minimal path analysis is equivalent to a straightforward analysis using actual physical system blocks. EXERCISE 10.10 Find the five minimal cuts for the system in Example 10.10 and use them to find the system reliability.

10.7 General Reliability Algorithms By starting with a general and flexible form for component failure rates, and using the previously discussed system models along with the defect model (Section 9.5) to build up to higher assemblies, we can reduce reliability evaluation to an algorithm that handles most applications. We assume that any system field failure can be categorized as belonging to one of the three following types: a manufacturing-defective part that escaped system test only to be “discovered” in the field; a reliability failure from a defective subpopulation; and a wear-out failure from the general population. The field use CDF would include a term for discovery of escapes and terms for as many reliability defect subpopulations as are needed. To this would be added the typical population competing failure modes. These competing distributions could have any of the shapes discussed in the chapters on the Weibull and the lognormal distributions (i.e., an early life or a constant or a wear-out shape). If we call the CDF for the general population competing wear-out failure modes FN, the complete algorithm is

FT = γFe + δFd + (1 − γ − δ)FN

where Fe is the discovery CDF for the γ proportion of defective escapes, Fd is the early life distribution for the δ proportion of reliability process defects (if there are more than one type, add as many more such terms as are needed), and FN is derived from the n normal product competing failure modes as FN = 1 − R1R2 Rn. This model is most effective if acceleration models for the parameters of Fd and FN are known. Then, the algorithm can be written to give the use CDF as a function of use conditions, making the expression very useful for a product that has varied applications. Note that the general algorithm is a mixture of the CDFs Fe, Fd, and FN, with each of these given weights that add up to 1. This mixture model corresponds to the physical assumption that all the defective subpopulations are separate from one another, with a defective component only having one type of defect and always failing eventually because of that defect. It would not be difficult to write a general algorithm where the defective subpopulation CDFs compete with one another and the normal failure CDFs using the competing risk model. This added consideration would lead to much more complicated equations, however, and little difference in failure rates for the typical cases where the weights for defective subpopulations are small. In many cases, two CDFs are sufficient to model the normal population competing failure modes. One CDF is an exponential and contributes a constant failure rate. The reliability function for this CDF will be denoted by RI. The second CDF contributes an increasing or wear-out type failure rate. This could be a suitable lognormal or Weibull distribution.

361

System Failure Modeling: Bottom-Up Approach

The reliability function for this CDF will be denoted by Rw. Then, FN is (by the series model) FN = 1 − RI RW . The expression for FT becomes

FT = γFe + δFd + (1 − γ − δ)(1 − RI RW ) EXAMPLE 10.12 GENERAL RELIABILITY ALGORITHM A dense integrated circuit module consists of an encapsulated microchip connected to a ceramic substrate. After months of life testing and modeling, distributions for the significant failure mechanisms at use conditions are known. There is a constant failure rate of 100 PPM/K and a lognormal wear-out distribution with σ = 0.8 and T50 = 975,000. In addition, it is estimated that about 0.2% of the modules have entrapped contaminants that lead to corrosion failure according to a lognormal distribution with σ = 0.8 and T50 = 2700. If the test coverage and efficiency allows 0.1% defective modules to be shipped, and the assumed discovery model is a Weibull with m = 0.5 and c = 400, give the failure rate curve for the module.

Solution All the parameters and distributions for the general reliability algorithm have been specified. The value of γ is 0.001, and δ = 0.002. The use CDF and PDF are calculated from FT = 0.001Fe + 0.002Fd + 0.997(1− R IRW )

fT = 0.001fe + 0.002fd + 0.997(fR I W + fwR I) where Fe and fe are the discovery Weibull CDF and PDF, Fd and fd are the lognormal CDF and PDF for the reliability defects, RI and fI are the reliability function and PDF for the exponential (constant failure rate) failure mode, and Rw and f w are the reliability function and PDF for the wearout lognormal. A graph of the failure rate hT(t) = f T(t)/(1 − FT(t)) is shown in Figure 10.12. Note the bathtub shape with a steep front end due to the discovery of escapes and reliability defect failures. The failure

700 600

PPM/K

500 400 300 200 100

0

10 K 20 K

30 K 40 K 50 K 60 K 70 K Power on hours

FIGURE 10.12 General reliability algorithm failure rate example.

80 K 90 K 100 K

362

Applied Reliability

rate settles down to a little over the constant adder of 100 PPM/K after about 15,000 hours. Wear out starts being noticeable by 50,000 or 60,000 hours, causing the failure rate to rise to nearly 180 PPM/K at 100,000 hours. EXERCISE 10.11 Plot a graph of the failure rate curve (0 to 100 K hours) under the following conditions: RI is exponential with λ = 50 FITs; RW is Weibull with β = 2 and α = 500,000 hours; Fe is exponential with MTTF = 1000 hours; λ = .5%; δ = 1%; and Fd is lognormal with σ = .4 and T50 = 700 hours.

10.8 Burn-In Models If the front end of a failure rate curve, such as the one shown in Example 10.11, is too high to be acceptable, and it is not possible to make significant improvements in the manufacturing process, there are two options available to improve the situation. By increasing test coverage and efficiency, the escape discovery portion of the front end can be reduced. Also, by stressing, or burning in, the product prior to shipment, the reliability defect portion of the front-end failure rate can be virtually eliminated. This section discusses how to use our general reliability algorithm to do a mathematical analysis of the possible benefits of a burn-in. Suppose the burn-in is for T hours at much higher than normal use stress. We need factors that can convert high stress test hours to use condition hours, for every failure mode (apart from defect escape discovery) in the general algorithm. These acceleration factors are obtained using methods such as those described in Chapters 8 and 9. Assume that the factor for the reliability defect mode is Ad, and the factors for the constant and wear-out modes are AI and Aw, respectively. If there are multiple mechanisms (and CDFs) for any of these types, more factors are necessary. Now we have to decide how efficient the burn-in testing is at catching all the failures produced by the burn-in stress. Are all these failures detected and removed prior to shipment? Or, do some escape to be discovered by the end user along with the other manufacturing escapes? Are any previously undetected manufacturing escapes caught by the burn-in testing? For simplicity, we assume that none of the γ manufacturing escapes are found and removed at the burn-in. Let the burn-in test efficiency be (1 − Be). This means that burn-in testing catches (1 − Be) of all the early life failures that happen during the burn-in period. Be is the fraction of these failures that escape and merge with the other γ manufacturing defect escapes. The fallout and escapes from burn-in are

{ New escapes = η = B {δF ( A T ) + (1 − γ − δ) [ 1 − R ( A T )R

} T )]}

Fallout = (1 − Be ) δFd ( AdT ) + (1 − γ − δ) [ 1 − RI ( AIT )Rw ( AwT )] e

d

d

I

I

w

( Aw

and the new proportion of shipped product that is defective (and will be discovered according to Fe(t)) is

Total escapes to field = (γ + η)/(1 − Fallout)

System Failure Modeling: Bottom-Up Approach

363

The (1 − fallout) term appears in the denominator because the population actually shipped has been reduced, and the defects are now a higher proportion. Similar correction terms appear in the expression for the field CDF after burn-in. Let γ =

γ+η 1 − Fallout

δ =δ

1 − Fd ( AdT ) 1 − Fallout

Then, the CDF is FBI (t) = γ Fe (t) + +

δ [ Fd (t + AdT ) − Fd ( AdT )] 1 − Fd ( AdT ) (1 − γ − δ ) [ RI ( AIT )Rw ( AwT ) − RI (t + AIT )Rw (t + AwT )] RI ( AIT )Rw ( AwT )

where the correction terms that appear in the denominators can also be viewed as making the probabilities of failure or survival conditional upon surviving the burn-in. Failure rates are obtained by taking the derivative of the after burn-in CDF FBI(t), and then calculating h(t): fBI (t) = γ fe (t) +

δ fd (t + AdT ) (1 − γ − δ ) f w (t + AwT )RI (t + AIT ) + fI (t + AIT )Rw (t + AwT ) + Rd ( AdT ) RI ( AIT )Rw ( AwT )

hBI (t) =

fBI (t) 1 − FBI (t)

EXAMPLE 10.13 BURN-IN MODEL The reliability engineers responsible for the component described in Example 10.12 decide to try a 9-hour burn-in to improve the front end of the failure rate curve shown in Figure 10.12. The three failure modes described by the CDFs Fd, FI, and Fw are accelerated by temperature according to an Arrhenius model, with ΔHs of 1.15, 0.5, and 0.95, respectively. The field use temperature is 65°C, and the proposed burn-in temperature is 145°C. Assuming perfect efficiency at catching failures generated by the burn-in, what will the expected burn-in fallout be? How will the new failure rate curve for burned-in product compare to the old curve? What are the old and new average failure rate, AFR(10,000), values?

Solution Using the methods described in Section 8.8, we use the formula = EXP(∆H × 11605 × (T2−T1)/ (T2 × T1)) to calculate acceleration factors of Ad = 2000, AI = 27, and Aw = 500. The fallout at burn-in is

Fallout = 0.002Fd (9 × 2000) + 0.997 [1− RI (9 × 27)Rw (9 × 500)] = 0.0020065 by substituting the proper CDFs from Example 8.7.

364

Applied Reliability

No burn-in

700 600

With burn-in

PPM/K

500 400 300 200 100

0

10 K

20 K

30 K

40 K

50 K 60 K Power on hours

70 K

80 K

90 K

100 K

FIGURE 10.13 Failure rate before and after burn-in.

The after burn-in CDF and PDF are

FBI (t ) = 0.001002Fe (t ) +

0.00001776 [Fd (t + 18000) − 0.9911] 0.008860

+ 0.9990 [ 0.9999757 − RI (t + 243)Rw (t + 4500)] fBI (t ) = 0.001002fe (t ) + 0.002004fd (t + 18000) + 0.9990 [ fw (t + 4500)RI (t + 243) + fI (t + 243)Rw (t + 4500)] Figure 10.13 shows how hBI(t) compares to the old h(t). The burn-in has made a significant improvement in the front-end failure rate, with little effect on the values after about 18,000 hours. The new AFR(10,000) is 201 PPM/K or FIT compared to the non–burn-in value of 390 PPM/K. This value was calculated from AFR(T) = 109 × (−lnR(T))/T.

In the last example, the theoretical analysis showed that a short burn-in would be beneficial. However, burn-in may not always be so effective since burn-in affects every failure mode differently. Constant failure rate modes are not improved at all, while wear-out modes are made worse. Only the early life failure modes, or those due to a reliability defective subpopulation, are affected; for those to be influenced significantly by a short burn-in, there must be a high acceleration factor. EXERCISE 10.12 Find the effect on the AFR from 0 to 1 K and 10 K hours of a 24-hour burn-in on the component described in Example 10.12. Assume the same acceleration models, burn-in, and field conditions as in Example 10.13. Plot both the new and old failure rate curves.

The most economical and effective way to improve early life performance is to improve the manufacturing process. With a dedicated return and analysis program of field failures, the major contributors to early life failure can be analyzed to identify their root causes and

System Failure Modeling: Bottom-Up Approach

365

corrective actions implemented during the manufacturing process. This will improve both yield and reliability. The next most efficient way to improve early life performance is by more effective tests, inspections, and screens. This can be combined with a statistical process control (SPC) program. In addition, there is the burn-in or last-minute stress screening option. This fix can be thought of as a process Band-Aid—it sometimes necessary no matter how much we would like to avoid it. The burn-in model of this section can be a useful tool, helping us to choose the best stress level and time for burn-in, or in some cases, warning us that burn-in is not likely to produce the results desired. Careful data collection and analysis of both burn-in fallout and later life performance should be part of any burn-in implementation plan. This is especially true when using stresses of the “shake, rattle, and roll” type, for which no known acceleration model may exist. Some useful references on burn-in are Peck and Trapp (1980) and Jensen and Petersen (1982). The authors call the defect subpopulation failures “freaks” and the escapes “infant mortality.” They also point out that it is common to introduce infant mortality defects during the manufacturing operations that occur after a component-level burn-in. These added defects may make the burn-in appear less effective.

10.9 The “Black Box” Approach: An Alternative to Bottom-Up Methods The approach described so far has been to model each failure mechanism within each component and use the competing risk model and the series and the parallel models (or the more complex models and algorithms described in the preceding sections) to build up to system-level failure rates and reliabilities. This approach is highly recommended for several technical reasons: acceleration models make physical sense only when applied at the individual failure mechanism level, and this situation is also the natural place to model a homogeneous population of failure times using one of the lifetime distribution models discussed. However, there are often compelling practical reasons to eschew a bottom-up approach. Frequently, a company purchases a complete assembly of components and uses this assembly within a box or system it manufactures and sells. It may be difficult or impossible to obtain failure mode–level, or even component-level, reliability information from the supplier of this assembly. It is a common practice to call this assembly a “black box” and test it as if it were a single component within the system. If the test is done at high laboratory stresses, a single (best guess or average) acceleration factor is used to translate the results to typical use conditions. A flexible life distribution such as the Weibull or the lognormal is then used to calculate failure rates and the results are added to the other system components to obtain a total system failure rate projection. This procedure saves the time and expense of either testing many unfamiliar components or doing extensive literature searches for reliability data on these components (with no guarantee of finding appropriate estimates for the actual system use conditions). For these practical reasons, black box testing will continue to be a common practice regardless of theoretical difficulties. It is important, however, to be aware of situations that can lead to inaccurate or misleading results. The following examples will show some of these situations and allow us to formulate guidelines to minimize the dangers inherent in black box testing.

366

Applied Reliability

EXAMPLE 10.14 BLACK BOX TESTING I A subassembly contains three components in a series. Assume that the true Arrhenius ΔH for component 1 is 0.6, for component 2 is 0.9, and for component 3 is 1.2. Also, assume that the true (but unknown) use failure rate for all three components is a constant 0.01%/K hour. It is proposed to test the assembly as if it were a single component with a ΔH of 0.9 (the actual average ΔH). Testing will be done at 125°C. What is the actual population use and test failure rate for this assembly? If the test results yield a perfect stress failure rate estimate (exactly equal to the theoretical test failure rate), what will the test engineers predict the use condition failure rate to be?

Solution The true assembly failure rate is the sum of the failure rates of the three components, or 0.03%/K hour. The true acceleration factors based on the actual Arrhenius ΔH constants are 30, 167, and 1624 for components 1, 2, and 3, respectively (see Chapter 7 for details on acceleration factor calculations with the Arrhenius model). That means the (theoretical) test failure rate is 0.30 + 1.67 + 16.24%/K hour = 18.21%/K hour. If the test failure rate estimate actually is 18.21%/K hour, however, use of a single ΔH of 0.9 would yield a use failure rate estimate of 18.21/167 = 0.11%/K hour. This would be an overestimate of the use failure rate by a factor of almost four times, although the test results were as accurate as possible. EXAMPLE 10.15 BLACK BOX TESTING II Assume the same situation as in the previous example except for the actual use failure rates of the three components. Instead of each having a 0.01%/K hour failure rate, component 1 has a 0.1%/K hour failure rate, while components 2 and 3 both have 0.001%/K hour failure rates. What are the true use and stress failure rates and what will be the test engineers’ estimate of the use failure rate, again assuming they obtain a perfect estimate of the stress failure rate?

Solution The use failure rate is 0.1 + 0.001 + 0.001 = 0.102%/K hour. The stress failure rate is 3 + 0.167 + 1.624 = 4.791%/K hour. This time the use failure rate would be estimated by 4.791/167 = 0.03%/K hour. This understates the true use failure rate by a factor of more than three times.

The two black box testing examples use the correct average ΔH, and there was no sampling error in the estimate of the stress failure rate of the assembly. Yet, in the first case, we overestimated the true use failure rate by four times, and in the second case, we underestimated the true use failure rate by more than three times. Obviously, a linear average of ΔHs leads to problems since the corresponding acceleration factors are not linear. Similar examples could easily be constructed using any other kind of average ΔH or average acceleration factor. On the other hand, if the failures occurring in the first example were analyzed down to type of component and failure mode, it would become obvious that most of the failures were due to component 3 (about 16 out of 18, or almost 90% of the failures would typically be due to component 3). The appropriate acceleration factor for those failures is 1624, much higher than the factor of 167 used in the calculation. It’s no wonder the failure rate was overestimated. In the second example, more than 62% of the failures would be from component 1. These are only accelerated by a factor of 30 and are treated optimistically by the use of an acceleration factor of 167. These results explain why the use failure rate obtained in the example was an underestimate. So when are we likely to have trouble using the black box approach? Clearly, when the average acceleration factor used in the analysis is inappropriate for a large number of the failures that occur during the stress test, we stand a good chance of running into difficulty.

System Failure Modeling: Bottom-Up Approach

367

We might also come to incorrect conclusions if a mechanism that contributes a significant failure rate at use conditions has a very low ΔH. The sample size at stress might not be high enough to see even one of these failures; yet, they might be the dominant cause of failure at use conditions. We can put together these observations into a set of general guidelines for black box testing:

1. Assess in advance the failure mechanisms likely to cause use condition failures. They should have approximately equal ΔHs or the approach is likely to be invalid. Use this ΔH for planning the test assuming an exponential model (see Chapter 3). 2. After the test is over, verify by failure analysis that the failure mechanisms were as expected. If other mechanisms are observed that might have widely varying acceleration factors, follow-up experiments using a bottom-up approach are called for. 3. If the possibility of components in the assembly having failure mechanisms with very low acceleration factors cannot be ruled out, extreme overstressing of a few assemblies is recommended. Many failures will occur, some of which are extraneous results of the overstressing, and others that are relevant. Analyze these failures to gain confidence that nothing has been overlooked. This kind of testing is sometimes referred to as “elephant testing.” The derivation of this term comes from picturing the assemblies placed one by one in the path of a walking elephant. It may be impossible to project a use failure rate from the results of this kind of testing, but the weak points of the assembly are certainly exposed.

10.10 Summary This chapter introduced the series and competing risk models and showed several ways of evaluating system reliability for complex systems with parallel and series components and other configurations. The escape discovery model and the defect model were used to formulate a general algorithm for component failure rates. If we have stress dependency incorporated into this general algorithm, we can also make trade-off analysis calculations concerning burn-in and test efficiency changes. We also saw how system redundancy provides enormous improvement in early life failure rates. As component costs go down, and high reliability becomes more essential, this design option will apply more and more. While a bottom-up approach to reliability analysis and failure rate projection is generally recommended, we also presented guidelines for black box testing.

Problems 10.1 A VCR has most of its components on one board. The board has 12 A type units, 15 B type units, and 6 C type units. All components operate independently. The first failure of any component causes the VCR to fail. Type A modules have an exponential failure rate with the MTTF = 300,000 hours. The B modules follow a

368

Applied Reliability

lognormal distribution with T50 = 30,000 hours and sigma = 1.3. The C modules have three significant failure mechanisms. All are described by Weibull distributions with shape parameters equal to 0.8 and characteristic lives of 500,000, 800,000, and 1,100,000 hours, respectively. What is the AFR of the VCR during the first 1,000 hours? What is the AFR during the next 2,000 hours? What is the chance that the VCR has no failures during the first 500 hours? 10.2 Certain radar systems have reliability described by the exponential distribution with MTTF = 600 hours. To assure better reliability, three identical systems are configured such that one system is operating while the other two are in standby mode. The failure of the first system causes an immediate switch to one of the remaining two. If the second one fails, the third in standby mode is also immediately switched on. Assuming a perfect switching device that never fails, find the PDF for the system lifetime. What is the system MTTF? What fraction of systems survive 2000 hours? Now assume that the switch has a probability of .001 of not working any given time it is used. What are the system PDF, MTTF, and survival probability at 2000 hours? (Hint: With probability of .9992, the system lifetime is the sum of three exponential lifetimes, with probability of .001 × .999, the system lifetime is the sum of two exponential lifetimes, and with probability of .001, the system lifetime ends after the first failure. Answer the questions for each case or possible branch and use these probabilities as weights to get the complete answer.) 10.3 A system consists of seven units: A, B, C, D, E, G, and H. For the system to function, unit A and either unit B or C and either D and E together or G and H together must work. Draw the diagram for this setup, and write the equation for the CDF of the system in terms of the individual component CDFs. 10.4 You have been asked to review the field failure rate projection made for a complex electronic board your company purchases from a vendor. You are told that 50 boards were tested at high stress for 2000 hours, with six failures observed (failed boards were immediately repaired and put back on test). Projections were made assuming an acceleration factor of 1000, which you are told is believed to be a conservative overall factor for the many different types of components on the board. The analysis was straightforward: six failures from 100,000,000 use equivalent board hours give a failure rate estimate of 60 FITs, with a 90% upper bound of 106 FITs. What further questions would you ask in order to validate this projection? What answers would lead you to accept the projection? What answers would lead you to recommend additional testing? What risks would you point out are inherent with this kind of testing and analysis? Can you think of other recommendations you might make that would give early warning if the test had failed to disclose a possibly serious field reliability exposure?

11 Quality Control in Reliability: Applications of Discrete Distributions Manufacturers often conduct various tests on samples from individual lots to infer the expected level of reliability of a product in customer applications. In addition, the stressing of consecutive groups of samples from production may be employed to monitor the reliability of a running process. Such results may be plotted on statistical control charts to verify that the manufacturing process is “under control.” However, as such studies can be costly and time consuming—and often destructive to units—it is important that efficient sampling designs be selected to provide the necessary information while using the minimum quantities of product. This chapter covers the implementation of various types of sampling plans for attribute data, the associated risks, the operating characteristic (OC) curves, and the choice of sample sizes. We discuss the applications of various discrete distributions, namely the binomial, geometric, negative binomial, Poisson, and hypergeometric. The calculation of confidence limits for distribution parameters is treated. We introduce Fisher’s exact test for comparing proportions when sample sizes are limited. Finally, we look briefly at the application of statistical process control charting for reliability. We illustrate how spreadsheets can be used effectively to perform calculations and thereby provide solutions to many problems related to discrete distributions.

11.1 Sampling Plan Distributions There are several important considerations that an engineer must keep in mind when choosing appropriate sampling plans. First, one must establish the scope of the inference: Are the results to be used to draw conclusions about an individual lot or about an ongoing process? If the former, then the size of the sample relative to the lot size must be considered. Obviously, if the sample size is a significant portion of the lot size, say over 10%, then there is more information about the quality of the remainder of the lot than there would be in sampling from a process capable of producing an infinite number of units. Indeed, different probability distributions are required to treat each case: the hypergeometric distribution applies when lot size must be considered; the binomial distribution holds for inference to a process (or a very large lot). The binomial distribution has many important applications that will be described in detail. However, to understand and derive the binomial distribution, we digress for a moment and discuss permutations and combinations.

369

370

Applied Reliability

11.1.1 Permutations and Combinations A permutation is an arrangement of objects in which order is important. For example, suppose we have three seating positions and three people: call them A, B, and C. We have three choices for placement in the first position: A, B, or C. Given occupancy of position 1, there remains two possible choices for position 2. That is, if B is in position 1, only A and C are left to sit in position 2. When both the first and the second positions are occupied, only one choice remains for the third seat. A quick way of determining the number of potential arrangements uses the following basic principle of counting: if an operation has x possible outcomes and each of these outcomes can be followed by an operation consisting of y possible outcomes, then the total number of outcomes for the successive operations is xy. Extension to three or more operations is obvious. See Ross (2002) for further treatment on this topic. By this product rule, there are a total of 3 × 2 × 1 = 6 possible seating arrangements. We use n!, called “n factorial,” to denote the descending product of numbers n(n − 1)(n − 2) 1. Thus, 3 × 2 × 1 = 3!. By convention, 0! = 1. We list the possible arrangements of the three objects as follows: Possible Arrangements of Three Objects ABC ACB BAC BCA CAB CBA

Each arrangement is called a permutation. Thus, the number of permutations of n objects is n!. Suppose we have eight objects and we wish to determine the number of ordered arrangements we can form using four objects at a time. Consider four available positions. The first position may be occupied by any of eight objects, the second by any of the remaining seven objects, the third by any of the six, and the fourth by any of the five left. Thus, the total number of permutations is 8 × 7 × 6 × 5 = 1680. We would not list these, but the general formula for the number of permutations of n objects taken r at a time is

P = n(n − 1)(n − 2) (n − r + 1)

n r

Note that there are r separate terms in the product above. Suppose we have eight objects and we want to select four at a time, but we do not care about the order of the four selected. Such would be the case, for example, if we were forming teams consisting of four players each from eight possible choices. Such selections in which order is immaterial are called combinations. To determine the number of combinations, one simply divides the number of permutations 8 P4 by 4!, the number of permutations of four objects at a time. Thus, the number of combinations of eight objects taken four at a time is

8

C4 =

8×7×6×5 = 70 4×3×2×1

Quality Control in Reliability: Applications of Discrete Distributions

371

In general, the expression for the number of combinations can be multiplied by 1 = [(n−r)!/ (n−r)!] to give a simpler expression:

n

P r! n(n − 1)(n − 2) (n − r + 1) = r! n(n − 1)(n − 2) (n − r + 1)(n − r )(n − r − 1) 1 = [(n − r )(n − r − 1) 1]r ! n! = (n − r )! r !

Cr =

=

n r

n r

n is the special symbol used to denote the number of combinations of n objects r taken r at a time. where

11.1.2 Permutations and Combinations via Spreadsheet Functions Both Excel and OpenOffice have routines to easily determine permutations and combinations. The functions are PERMUT and COMBIN. The two arguments in each function are n and r. In Excel, select Insert, Function, and choose the Statistical category in the drop-down menu. Scroll to PERMUT and click OK. Enter n for Number and r for Number_chosen, followed by OK. As an example, PERMUT(10,3) returns the result 720. In OpenOffice and Excel, the procedure for COMBIN is similar except that for some unknown reason, the function COMBIN is listed under the Mathematical category. Thus, in OpenOffice, COMBIN(10;3) gives the result 120. EXERCISE 11.1 How many different 11-letter string can be formed from the letters in the word MISSISSIPPI? Hint: Consider the number of permutations if all 11 letters were unique. Then, divide by the permutations of common letters. EXERCISE 11.2 Let S denote a survival and F a failure. Given two failures among five items on stress, how many different ways can the two failures occur on the five objects? Hint: One way is SSSFF. EXERCISE 11.3 How many different ways can r failures occur among n units on stress? (Note that time to failure is not the issue here. We are allocating the designation “failure” in any order among n stressed units.) Hint: See the previous exercise and note that this problem is analogous to the forming of n letter words from r F’s and (n−r) S’s.

372

Applied Reliability

11.1.3 The Binomial Distribution The exponential, Weibull, normal, and lognormal distributions are examples of continuous distributions. However, many problems involve data in a discrete form. For example, we consider the following two situations:

1. The probability of failure of a component at time t is 0.1. What is the probability of at least one failure among 10 similar units placed on stress and run for time t? 2. What is the probability of exactly no failures, or 1, 2, … , or 10 failures by time t?

Such questions usually can be handled via the binomial distribution. Four conditions are necessary for the binomial distribution to apply:

1. Only two outcomes are possible (e.g., success and failure). 2. There is a fixed number (n) of trials. 3. There exists a fixed probability, p, of success from trial to trial. 4. The outcomes are independent from trial to trial.

In general, for reliability work, the first two conditions are met; the third is assumed to hold, at least approximately; and the fourth is usually applicable. For example, an engineer stresses a fixed number of units (Condition 2) for a desired time. Each unit will either survive or fail the test (Condition 1), and the failure of one unit does not affect the probability of failure of the others (Condition 4). We also assume all the units have been obtained from the same population and thus have the same probability of failure (Condition 3) at any point in time as given by the CDF value p = F(t). The general expression for the binomial distribution, which gives the probability of exactly x failures in n trials with probability of failure p per trial, is

P(X = x) =

n x p (1 − p)n− x x

x = 0, 1, 2 , … , n

It is illustrative of the application of the probability rules covered in the first two chapters to develop the binomial distribution directly. Consider a component with known probability of failure by time t given by p = F(t), where F(t) is the CDF value (note in general that p can represent success or failure probabilities because only two outcomes are possible: one with probability p and the other with probability q = 1 − p). Suppose we have n such units and we run them on stress to t hours. One possible result is that at the end of the experiment there are no failures. The probability of a single unit surviving is 1 − p. The probability that all n independent units survive is given by the product rule as P(X = 0) = (1 − p)(1 − p) (1 − p) (n terms)

= (1 − p)n

Consider the case of a single failure. A possible sequence (where F = failure and S = survival) might be

FSSS SSS

that is, the unit in position one fails and the remaining n − 1 units survive. The probability of this specific sequence is

Quality Control in Reliability: Applications of Discrete Distributions

373

p(1 − p)(1 − p) (1 − p) = p(1 − p)n−1

Normally, we are not concerned with a specific unit in a group of units that fail—only that one unit did not survive. Thus, we are interested in how many ways can any one part fail out of n items, where order does not matter. As Exercise 11.3 shows, the answer is given by the combination formula n C1 = n. As all sequences with one failure and n − 1 survivors have the same probability and there are n such mutually exclusive sequences, then by the union rule for probabilities, the probability for exactly one fail is

P(X = 1) = np(1 − p)n−1

Similarly, the probability of a given sequence in which two failures occur and n − 2 survive is given by

p 2 (1 − p)n− 2

As there are n C2 = n(n − 1)/2 different ways of two failures occurring among n units where order is immaterial, the probability of exactly two failures in n items is

P(X = 2) =

n(n − 1)p 2 (1 − p)n− 2 2

Continuing this way, we see in general that the probability of getting exactly x failures from n items on stress is

P(X = x) =

n x p (1 − p)n− x x

This expression is also called the probability mass function or binomial pmf. We will often use the shorthand notation P( x) for P(X = x). For the binomial distribution, the mean or expected number of failures is np; the variance is np(1 − p); the standard deviation is np(1 − p) . EXAMPLE 11.1 BINOMIAL CALCULATIONS 100 lightbulbs will be stressed for 1000 hours. From previous experimental work, the probability of a bulb failing by 1000 hours is .01 or 1%. Assuming the bulbs on stress are from the same population,

a. What is the probability that all bulbs survive 1000 hours? b. What is the probability of exactly one bulb failing? c. What is the probability of at least one bulb failing?

a. (1 − p)n = (1.00 − 0.01)100 = 0.99100 = 0.366. Therefore, approximately one chance out of three exists that no bulbs will fail by 1000 hours. b. np(1 − p)n−1 = 100(0.01)(0.99)99 = 0.370 c. We could calculate the individual probabilities of 1, 2, 3, and so on, failures and add these together to get the answer. However, a much simpler procedure recognizes the fact that the probability of at least 1 failure equals 1 minus the probability of no failures. Thus,

Solution

P( X > 0) = P( X ≥ 1) = 1 − P( X = 0) = 1 − 0.366 = 0.634

374

Applied Reliability

In roughly two out of three such experiments, we would expect at least one failure, but only about one-third of the time will there be exactly one failure. EXERCISE 11.4 50 devices are placed on stress for 168 hours. The probability of a device failing by 168 hours is .05 or 5%. (a) What is the probability that all devices survive 168 hours? (b) What is the expected number of failures? (c) What is the probability of at least one failure? EXAMPLE 11.2 BINOMIAL PMF Figure 11.1 shows a plot of the binomial probability mass function for the case of 20 units sampled from a population with individual probability of failure at 100 hours of .20. Recall the binomial distribution has an expected number of failures np and variance np(1 − p). Here, the expected number is 20 × .20 = 4 and the variance is 20 × .20 × .80 = 3.2. Note the mean np corresponds to the peak in the binomial pmf. EXERCISE 11.5 Sketch out the binomial pmf for the situation where n = 10 and p = .05. Repeat for n = 10 and p = .5. Estimate the mean and variance for each case. Comment on the appearance of the pmf’s.

11.1.4 Cumulative Binomial Distribution Often, we are interested in the probability of several possible events occurring. For example, we might accept a lot for shipment if there are three or fewer rejects in the drawn sample of size n. To get the probability of x or fewer events, we sum the individual binomial probabilities. Thus, x

P(X ≤ x) = P(0) + P(1) + + P(X ) =

∑ P(k) k =0

n = 20, p = .20

0.25

Probability of x failures

0.2 0.15 0.1 0.05 0

FIGURE 11.1 Binomial distribution.

0

1

2

3

4 5 6 7 8 x = number of failures

9

10

11

375

Quality Control in Reliability: Applications of Discrete Distributions

Note that by the discrete nature of events, n

P(X > x) = P(X ≥ x + 1) =

∑ P(k)

k = x+1

and

P(X ≤ x) + P(X > x) = 1

Consequently,

P(at least one) = P(X ≥ 1) = 1 − P(0)

Figure 11.2 shows a plot of the cumulative distribution function (CDF) for the binomial distribution for the case of 20 units sampled from a population with individual probability of failure at 100 hours given by .20. EXERCISE 11.6 Sketch out the binomial CDF for the situation where n = 10 and p = .05. Repeat for n = 10 and p = .5. Comment on the appearance of the CDFs. EXERCISE 11.7 A baseball player has a current batting average of .250. In a game, he will bat four times. What is his expected number of hits? What is the standard deviation? What is the probability of no hits, exactly one hit, four hits, and at least one hit?

11.1.5 Spreadsheet Function for the Binomial Distribution BINOMDIST is the worksheet function for the binomial distribution in both Excel and OpenOffice. The function has four arguments: x, n, p, and a logical value (0,1 or true or false) depending on whether the pmf or cumulative probabilities are desired. For example, n = 20, p = .20

Probability of x or fewer failures

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

1

FIGURE 11.2 CDF for binomial distribution.

2

3

4 5 6 7 x = number of failures

8

9

10

11

376

Applied Reliability

the probability of exactly x = 3 failures out of n = 5 items on test with an item failure probability of p = .1 is given by BINOMDIST(3,5,…1,0), which returns .0081. The probability of 3 or fewer failures for the same situation is provided by BINOMDIST(3,5,…1,1), which results in .99954. 11.1.6 Relation of Binomial Distribution to Beta Distribution In Chapters 4 and 5, we discussed several relationships among continuous distributions, for example, the exponential to the Weibull, or the normal to the lognormal. However, the relationships also exist between continuous and discrete distributions. One that is useful is the association between the CDFs of the binomial distribution and the beta distribution. The beta distribution is described in Chapter 1, Example 1.8. Let P be a random variable defined over the interval 0 ≤ x ≤ 1 and having a beta distribution with integer parameters a > 0 and b > 0. The CDF of the beta distribution (commonly called the incomplete beta function) in terms of the variable p can be expressed as p

F ( p) =

∫ 0

x a−1 (1 − x)b−1 dx = I p ( a, b) B( a, b)

where B(a,b) is evaluated using gamma functions as

B( a , b) =

Γ( a)Γ(b) Γ( a + b)

To show the relationship between the binomial distribution and the beta, we define new integer parameters r = a and b = n − r + 1, allowing us to express the incomplete beta function as

n! I p (r , n − r + 1) = (r − 1)(n − r )!

p

∫x

r −1

(1 − x)n− r dx

0

Integration by parts can show that the binomial distribution is related to the beta distribution via the following formula: n

I p (r , n − r + 1) =

∑ k =r

n k p (1 − p)n− k k

Alternatively, we can rewrite the relation between the beta and binomial CDFs in the following form: r −1

1 − I p (r , n − r + 1) =

∑ k =0

n k p (1 − p)n− k k

We will shortly see how these useful relations can be applied.

Quality Control in Reliability: Applications of Discrete Distributions

377

11.2 Nonparametric Estimates Used with the Binomial Distribution Suppose we have insufficient data to determine the underlying distribution for a component, but we know from previous studies that about 95% of units survive at 100 hours. Such an estimate is called nonparametric because no specific, continuous distribution of the failure times is assumed. We are planning a 100-hour mission requiring 20 such units to be operational. What is the probability of success for the mission, defined as no failures? The nonparametric estimate for the probability of failure is p = 1 − .95 = .05. Thus, the probability of no failures among 20 units is the same as the probability that all survive, that is,

(1 − p)n = (0.95)20 = 0.359

We realize that there is only about one chance in three, roughly, of having all components operational during the mission. We must go back to the drawing board to improve our reliability! EXERCISE 11.8 In terms of survival probability for the mission duration, how reliable must each of the 20 components be to achieve a mission success level of at least 0.99?

11.3 Confidence Limits for the Binomial Distribution Suppose we have a population of 50,000 integrated circuits, and we sample 100 units and stress for a particular mode of failure. At the end of the experiment, we observe four failures. We estimate the population percent defective as 4/100 or 4%. However, how precise is that estimate? What is the range of uncertainty? We wish to make a statement of inference about the probable range of percent defective in the population for this failure mode. That is, we will state an interval that has certain degree of confidence of capturing the population proportion, for example, we are 90% confident that the true percent defective is between, say, 1.4% and 8.9%. In applying confidence limits, we are effectively “tossing a horseshoe” at a population value for the proportion p. For 90% confidence, we expect that 9 out of 10 times we should capture the true value within the limits given. Greater confidence, for example, 95%, with the same sample size requires a larger horseshoe, that is, wider limits. How do we define a confidence interval for a binomial proportion based on sample results? For a single-sided upper confidence limit, we search for a pu value that would make the probability of the observed results match a desired confidence level. For example, for 95% confidence, we seek a pu limit such that the probability of observing four or fewer failures for 100 units on test would be 5%. Using the cumulative binomial distribution, we solve for pu using the formula 4

∑ k =0

100 k pu (1 − pu )100− k = 0.05 k

378

Applied Reliability

Prior to the advent of computers and spreadsheet functions, a simple way for a researcher to get confidence intervals for binomial proportions was to refer to the classic Clopper and Pearson (1934) charts for confidence limits for p in binomial sampling. In the previous edition, we provided such charts. However, the availability of spreadsheet functions has made the calculation of exact confidence intervals for binomial proportions an easy task. We make use of the relationship between the binomial and beta CDF, to write the previous equation as 4

1 − I p (r , n − r + 1) =

∑ k =0

100 k pu (1 − pu )100− k = 0.05 k

Since the upper summation limit is r − 1 = 4, then r = 5 and n − r + 1 = 96, resulting in

I pu (5, 96) = 0.95

Using tables of the incomplete beta function, we could look up the appropriate p value for a given probability. However, we instead employ spreadsheet capabilities to provide the inverse beta function. The worksheet function is BETAINV with three arguments: the probability, r, and n, respectively. So, for example, in OpenOffice we see that BETAINV(0.95;5;100) returns the value 0.0820. We are 95% confident that the true value of p is not larger than 8.2%. As a check, we note that the expression BINOMDIST(4;100;.0820,1) gives 0.05. Similarly, we can determine a lower limit pl at some confidence level, say 95%. For a single-sided lower confidence limit, we search for a pl value that would make the probability of the observed results match a desired confidence level. For example, for 95% confidence, we seek a pl limit such that the probability of observing four or more failures for 100 units on test would be 5%. Using the cumulative binomial distribution, we solve for pl in the formula n

∑ k=4

100 k pl (1 − pl )100− k = 0.05 k

We make use of the relationship between the binomial and beta CDF, to write the above equation as 100

I p (r , n − r + 1) =

∑ k=4

100 k pl (1 − pl )100− k = 0.05 k

Here r = 4, and n − r + 1 = 97, resulting in

I pu ( 4, 97 ) = 0.05

In Excel, BETAINV(0.05,4,97) returns the value 0.0138. We are 95% confident that the true value of p is larger than the lower limit of 1.38%. As a check, we note that 1-BINOMDIST (4,100,.0138,1) gives 0.05. Alternatively, we are 90% confident that the true value lies between 1.38% and 8.20%. To summarize, exact 100(1 − α)% two-sided confidence limits for the binomial parameter p based on x failures out of n units on test may be obtained from the inverse beta function.

Quality Control in Reliability: Applications of Discrete Distributions

379

The lower limit is 0 for x = 0 and BETAINV(α/2, x, n − x + 1) for x between 1 and n. The upper limit is 1 for x = n and BETAINV(1 − α/2, x + 1, n − x) for x between 1 and n − 1. EXERCISE 11.9 Refer to Exercise 11.4, in which 50 devices are placed on stress for 168 hours. The probability of a device failing by 168 hours is assumed to be .05 or 5%. At the end of the test, there are 10 failures. Provide a 95% confidence interval for the population failure proportion. Does the interval include the assumed 5% failure probability?

11.4 Normal Approximation for Binomial Distribution If np and n(1 − p) are sufficiently large, say, each term greater than 5, then the fraction of defects X/n in a sample of size n will have a sampling distribution that is approximately normal with mean p and variance p(1 − p)/n. This is the same as saying the statistic X −p n p(1 − p) n

has a sampling distribution that is approximately standard normal with mean 0 and variance 1. This expression forms the basis for calculating confidence limits for the binomial distribution for large n based on the normal approximation. The approximation is considered acceptable (see Ross 2002) for values of n satisfying np(1 − p) ≥ 10. If we observe x defects in a large sample of size n, the estimate for the unknown population value p is pˆ = x/n. Approximate 95% confidence intervals will then be given by the expression

pˆ ± z

1−

α 2

pˆ (1 − pˆ ) x = ±z α 1− n n 2

x x 1− n n n

where the standard normal variate z1−α/2 = 1.645. This formula also serves as the basis for calculating control limits for statistical process control charting of attribute data (see Section 11.16.) However, we caution that arbitrary application of this equation can produce inaccurate results, even for fairly large sample sizes, when the proportion p of interest is very small (low PPM). EXAMPLE 11.3 SHORTCOMINGS OF THE NORMAL APPROXIMATION Two rejects are observed out of 80,000 units. Estimate the PPM and provide a 95% confidence interval based on the normal distribution.

Solution The estimated PPM is 2/80,000 = 25 PPM. A 95% confidence interval based on the normal approximation is (−9.7, 59.7) PPM. Negative PPMs do not have physical meaning, and so individuals

380

Applied Reliability

often state the interval as (0, 59.7). Even then, the interval is quite inaccurate when compared with the previously described exact estimate 95% confidence interval of (3, 90) PPM. EXERCISE 11.10 Refer to Exercise 11.4, in which 50 devices are placed on stress for 168 hours. At the end of the test, there are 10 failures. Using the normal approximation, determine a 95% confidence interval for the population failure proportion. Compare with the answer in Exercise 11.9. Do you feel the normal approximation is adequate in this case? EXERCISE 11.11 Three rejects are observed in a random sample of 65,000 units. Estimate p in PPM and provide an exact and a normal approximation 90% confidence interval on p.

11.5 Confidence Intervals Based on Binomial Hypothesis Tests An alternative approach was recommended by Agresti and Coull (1998) and also by Brown, Cai, and DasGupta (2001). This methodology was originally developed by Wilson (1927). The approach is to use the form of the confidence interval that corresponds to the two-sided hypothesis test for the unknown p of the binomial distribution. Specifically, the confidence interval has the form z2

z2 α 1− pˆ (1 − pˆ ) 2 pˆ + ±z α + 2 1− 2n 4 n n 2 1 + z 2 α /n 1−

α 2

1−

2

Although the exact confidence intervals described previously will always have coverage probability of at least 1 − α for every value of p, Agresti and Coull show that these Wilson intervals have coverage probabilities close to the nominal confidence level. One advantage of this procedure is that its effectiveness does not strongly depend upon the value of n and/or p, and indeed it was recommended by Agresti and Coull for virtually all combinations of n and p. Another advantage is that the lower limit cannot be negative. That is not true for the normal confidence expressions that were given in Section 11.4. A confidence limit approach that produces a lower limit that is an impossible value for the parameter for which the interval is constructed is a somewhat questionable approach. This confidence interval method is also known as score confidence intervals and is used in JMP, as shown in the next example. EXAMPLE 11.4 SCORE CONFIDENCE INTERVALS Calculate another 95% confidence interval for the sample data in Example 11.3. This time use the score confidence interval method, either solving the upper and lower limit equations directly or by letting JMP do the calculations.

Solution Figure 11.3 shows both how to set up a JMP worksheet with binomial data and how to input that data into the Distribution Platform query box. Figure 11.4 shows how the JMP Distribution

Quality Control in Reliability: Applications of Discrete Distributions

381

FIGURE 11.3 Binomial data analysis in JMP.

FIGURE 11.4 JMP binomial confidence interval calculation.

Platform calculates score confidence intervals for this data. Note that the “Pass or Fail” Y entry column has to be either nominal or ordinal type data. The score method interval is (6.9, 91.2) PPM. This is slightly less conservative than the exact method interval for the lower bound and almost exactly the same for the upper bound. EXERCISE 11.12 Repeat Exercise 11.10, replacing the normal approximation method with the score method.

382

Applied Reliability

EXERCISE 11.13 Repeat Exercise 11.11, replacing the normal approximation method with the score method.

Finally, we mention that Agresti and Coull suggest that for simplicity the normal approximation formula can provide improved confidence interval coverage probabilities for most situations by using an adjusted p value in the confidence interval formula, where pˆ adj = ( x + 2) (n + 4). Because of the discreteness of the binomial distribution, the coverage provided by the approximate results can be more useful at times than the conservative exact results.

11.6 Simulating Binomial Random Variables Simulation of the binomial random variable X (where X represents the number of failures in n independent trials each trial having failure probability p) can be performed fairly simply. We generate n random numbers U 1 , U 2 , … , U n uniformly distributed in the unit interval (0,1). The spreadsheet function RAND() can be used for this purpose. Then, we set X equal to the number of the U i that are less than or equal to p. The event U i ʺ p , that is, the event in which the i-th trial generates a uniform random variate than is less than p, occurs with probability p. Hence, the number of occurrences of U i ʺ p in n trials results in a binomial (n, p) random variable. However, this method requires the generation of n random numbers to obtain just one value X for the number of occurrences. Is there a more computational efficient method that we can use? Consider the binomial CDF F(x) = P(x ≤ X) for n = 4, p = 0.5, shown in Table 11.1 and plotted as Figure 11.5. In Chapter 1, we described how to generate any continuous random variable by solving the CDF equation, F(x) = u for the inverse function F−1(u) = x and substituting for F in the inverse formula random numbers from a unit uniform distribution. For a discrete distribution like the binomial, the mapping of the CDF into the unit interval is no longer one-to-one. However, we can use a similar approach by viewing the cumulative distribution plot in an inverse manner and graphically finding the x-value corresponding to a range of F(x) values. For example, in Figure 11.5, a CDF value greater than 0.3125 but less than 0.6875 would correspond to x = 2. Fortunately, a spreadsheet function exists to provide this inverse functionality. TABLE 11.1 Binomial Cumulative Distribution Function: n = 4, p = .5 x

P(X ≤ x)

0 1 2 3 4

0.0625 0.3125 0.6875 0.9375 1

383

Quality Control in Reliability: Applications of Discrete Distributions

1

P(X ≤ x)

0.8 0.6 0.4 0.2 0

0

1

2 x

3

4

FIGURE 11.5 Binomial CDF n = 4, p = 0.5.

The useful function is CRITBINOM(n, p, Φ), which returns the smallest x-value for which the cumulative binomial distribution is greater than or equal to F. So, for example, for n = 4 trials, with p = .5, and F = 0.3, CRITBINOM(4, 0.5, 0.3) returns the value 1. For simulation, we now replace F with the function RAND(), and then the formula CRITBINOM(n, p, RAND()) generates a binomial random variable X, the number of failures in n trials, each of which has failure probability p. See Winston (2007) for simulation examples using this approach. EXAMPLE 11.5 SIMULATION OF SYSTEM RELIABILITY It is known that a component has probability of failure of .01 (1%) after 350,000 hours (approximately 40 years) of field usage. A system consists of six such components, each of which must function for the system to work. What percent of systems are expected to fail in 40 years?

Solution We know the exact solution for this problem. The system failure probability is 1 − (1 − 0.01)6 = 0.0585 or 5.85%. To conduct a simulation study, we generated a binomial (n = 6, p = .01) random variable 5000 times in a spreadsheet by copying the formula CRITBINOM(6, 0.1, RAND()) into 5000 rows in a single column. Then, we counted the number of non-zero rows in the column, using the function COUNTIF(column range, “≥1”), because a non-zero row indicates at least one failure in six trials, implying a system failure. Dividing this result by 5000, the simulation outcome was 0.0590 or 5.90%, in close agreement with theory. Although this example is simplistic, more complicated problems—that may be difficult or impossible to solve theoretically—can often be addressed successfully via simulation. EXERCISE 11.14 A system consists of two boxes. Each box consists of three independent components. All components must work for the box to function. However, the system will work if either

384

Applied Reliability

box is operational. If every one of the components have a probability of failing by mission end of .025, find via simulation the system reliability at mission end. Compare the simulation result to an exact calculation.

11.7 Geometric Distribution We described in Section 11.1.3 the conditions (i.e., two outcomes, fixed n, fixed p, independence) for binomial trials. Under those same circumstances, there is another aspect of interest, namely the number of sequential trials to the first occurrence of the event (success or failure). For example, if we toss a fair coin repeatedly, we might like to know the probability of the first head occurring on the first toss, or the second toss, the third, and so on. The answer is given by the geometric distribution. As the success has to be preceded by x − 1 failures, for independent events the probability of observing the first failure in x trials is P(X = x) = p(1 − p)x−1

The expected value is E(X ) =

1− p 1 and the variance is Var(X ) = 2 (see Ross 2002.). p p

EXAMPLE 11.6 GEOMETRIC DISTRIBUTION Previous experience for a manufacturing process has shown the probability of a defect to be .10. A new process is supposed to eliminate the defect. On the average, we would expect to see the defect in 1/p = 10 trials. Because of the cost of checking for the defect, we would like to get an early indication before committing to the new process that the defect is eliminated with at least 50% confidence. How many trials are required?

Solution The probability that the defect appears with the first trial is .1. For the second trial, the probability is 0.1(0.9) = .09. Continuing in this manner, we see in Table 11.2 that the cumulative probability for the geometric distribution exceeds the 50% requirement after seven trials.

TABLE 11.2 Cumulative Probability Table for Geometric Distribution Trials 1 2 3 4 5 6 7

P(X = x)

P(X ≤ x)

0.100 0.090 0.081 0.073 0.066 0.059 0.053

0.100 0.190 0.271 0.344 0.410 0.469 0.522

Quality Control in Reliability: Applications of Discrete Distributions

385

EXERCISE 11.15 For Example 11.6, create graphs showing the pmf and CDF up to 30 trials.

11.8 Negative Binomial Distribution In contrast to the geometric distribution, in which we are interested in the number of trials to the first occurrence, we may also be interested in the probability associated with the number of trials to the second, third, or more occurrences. The probability the r-th occurrence takes place in the x-th trial is given by the negative binomial distribution

x−1 r p (1 − p)x− r r−1

P(X = x) =

for x = r, r + 1, r + 2, … The expected value is E(X ) =

Var(X ) =

r and the variance is p

r(1 − p) (see Ross 2002). p2

EXAMPLE 11.7 NEGATIVE BINOMIAL DISTRIBUTION The probability of a wooden baseball bat cracking upon hitting a pitch is .01. During a typical baseball game, approximately 100 balls are actually struck. What is the probability that the third broken bat occurs at the hundredth pitch?

Solution

P( X = 100) =

99 0.013 (0.99)97 = 0.00183 2

What might actually be of greater interest is the probability of three cracked bats by 100 or fewer pitches. That probability would be given by the cumulative negative binomial distribution, that is, x

P( X ≤ x ) =

∑ i=r

i −1 r p (1 − p)i − r r −1

for x = 100 and r = 3. The solution is .0794.

The spreadsheet function for the negative binomial distribution is NEGBINOMDIST with three arguments x − r, r, and p, respectively. Thus, referring to Example 11.7, NEGBINOMDIST(97,3,.01) returns the value .00183. Note, if we use x for the first argument instead of x − r, the function provides the probability of x + r trials until the r-th event. EXERCISE 11.16 For Example 11.7, create graphs showing the pmf and CDF up 100 trials.

386

Applied Reliability

11.9 Hypergeometric Distribution and Fisher’s Exact Test 11.9.1 Hypergeometric Distribution What is the probability of drawing two aces in a row from a deck of well-shuffled cards? In fact, this is an incomplete question. To answer correctly, we need to know how the drawing was performed. Was the first card replaced and the deck reshuffled, thereby assuring a constant probability of 4/52 for each independent draw? Was the first card not replaced and the second draw taken from the remaining deck of 51 cards? In this case, the probability of an ace on the second draw given an ace on the first draw is now 3/51. The events are not independent. When sampling from a finite population without replacement, the hypergeometric distribution can be a useful model. The hypergeometric distribution is described by the following equation, which gives the probability of getting X = x rejects in a sample of size n drawn from a finite lot of size N containing a total of m rejects:

P(X = x) =

m x

N−m n− x N n

x = 0, 1, 2 , ... , n

Note that the denominator is the number of combinations of objects, taken n at a time, from N. Similarly, the first term in the numerator is the total number of ways of getting x defectives in the sample of size n when m defectives exist in the lot of size N. The second term is the number of ways of getting n − x non-defectives in the remaining group of size N − m. So, the probability of drawing two aces in a row without replacement is

P(X = 2) =

4 2

48 0 = 0.004525 52 2

or one chance in 221. Because of the factorials, the hypergeometric formula is computationally difficult for large numbers. However, in the common situation where N is large relative to n, say, over 10 times greater, the binomial distribution gives an accurate approximation to the hypergeometric distribution. EXAMPLE 11.8 HYPERGEOMETRIC DISTRIBUTION A failure analysis laboratory receives 10 units reported defective in an accelerated stress experiment. The electrical characteristics of the rejects are very similar, and the engineer does not have time to analyze all 10 units. He randomly chooses four parts for the failure autopsy. If there were actually three units with one cause of failure (say, type A) and seven with another reason for failing (say, type B), what is the probability that he fails to see any of the type A failures, that is, none of the four examined units are type A?

Quality Control in Reliability: Applications of Discrete Distributions

387

Solution Use the hypergeometric distribution with N = 10, n = 4, m = 3, solve for X = 0:

P( X = 0) =

3 0

(7 × 6 × 5) 10 − 3 1× (3 × 2 × 1) 4−0 = 0.167 = (10 × 9 × 8 × 7) 10 (4 × 3 × 2 × 1) 4

Hence, there is about a 17% chance of not detecting one of the failure causes.

The worksheet function for the hypergeometric distribution is HYPGEOMDIST with arguments x, n, m, and N. The program returns the exact probability for the number of events x. There is no additional argument in this function to obtain the cumulative hypergeometric distribution as was the case for the binomial distribution. EXERCISE 11.17 A sock drawer contains 12 socks, 4 black and 8 red, all mixed together. In the process of randomly selecting two socks in darkness, what is the probability of getting a matching pair of red socks? What is the probability of getting a matched pair of either color?

11.9.2 Fisher’s Exact Test Consider the situation in which we have few failures even with fairly large sample sizes and we want to test whether two failures proportions are significantly different. For example, in one test, there was one failure out of 1000 items, and in the second test, there were three failures out of 500 on a test. Do these results indicate a statistically significant difference or is this difference likely to occur by chance alone? In most introductory statistical tests, the statistical test for the equality of two proportions is based on large samples and the normal approximation. However, as we have seen earlier in Section 11.4, the normal approximation can result in inaccurate outcomes for small proportions, even with large samples. The normal approximation test of the equality of two proportions is based on the statistic (Walpole et al. 2002)

pˆ 1 − pˆ 2

z=

pˆ (1 − pˆ )

1 1 + n1 n2

where

pˆ =

x1 + x2 n1 + n2

For x1 = 1, n1 = 1000, x2 = 3, n2 = 500, we have pˆ 1 = 0.001 and pˆ 2 = 0.006. Consequently, pˆ 1 − pˆ 2 = −0.005, pˆ = 0.00267 and z = −1.77 . Twice the spreadsheet function NORMSDIST (−1.77) provides the two-sided p value 0.0767. For a significance level of 10%, we reject equality between the two population proportions and conclude that p2 ≠ p1.

388

Applied Reliability

Let us now consider Fisher’s exact test. For ease of analysis, the sample outcomes are displayed in a 2 × 2 contingency table (Table 11.3) along with the marginal totals (i.e., row and column sums). Fisher’s exact test, based on the hypergeometric distribution, calculates the probability of observing differences equal to or greater than the results observed. For a one-sided test, the interest is whether sufficient information exists to conclude either that p2 > p1 or p2 < p1 instead of the equality p1 = p2. Here, the total sample size is N = 1500, the number in Group 1 is n = 1000, and the total failure count is m = 4. The probability of getting exactly x = 1 failure in Group 1 for a sample of size n = 1000 drawn from N = 1500 with m = 4 failures available is given by the hypergeometric distribution as

P(X = 1) =

4 1

1946 4 × 1000 × 500 × 499 × 498 999 = = 0.09857 1500 1500 × 1499 × 1498 × 1497 1000

Alternatively, the worksheet function HYPGEOMDIST(1,1000,4,1500) in Excel gives the same value. However, under equal proportions, we need to consider the probabilities of even more extreme differences with the same marginal totals. A bigger difference with the same marginal totals would be zero failures in 1000 in the first group and four failures in 500 in the second group, for pˆ 1 − pˆ 2 = −0.008. Table 11.4 for this case is shown. The hypergeometric distribution provides the probability of this occurrence as

P(X = 0) =

4 0

1496 500 × 499 × 498 × 497 1000 = = 0.01225 1500 1500 × 1499 × 1498 × 1497 1000

Alternatively, the worksheet function HYPGEOMDIST(0;1000;4;1500) in OpenOffice provides identical results. So, the one-sided probability of observing the actual results or TABLE 11.3 Contingency Table 1 for Fisher’s Exact Test Group 1

Group 2

Totals

999 1 1000

497 3 500

1496 4 1500

Pass Fail Totals

TABLE 11.4 Contingency Table 2 for Fisher’s Exact Test Pass Fail Totals

Group 1

Group 2

Totals

1000 0 1000

496 4 500

1496 4 1500

Quality Control in Reliability: Applications of Discrete Distributions

389

stronger differences under p1 = p2 is the sum 0.09857 + 0.01225 = 0.1108. Fisher’s exact test shows that these results could occur by chance alone in roughly one out of nine times; that is, we cannot reject the possible equality of the population proportions, even at a 10% significance level. The normal approximation gave misleading results compared with the more accurate Fisher’s exact test. For the same marginal totals, we note there are other possibilities for the failure occurrences among the two groups: 2/1000 versus 2/500, 3/1000 versus 1/500, and 4/1000 versus 0/500, with differences pˆ 1 − pˆ 2 equal to −0.002, 0.001, and 0.004, respectively. None of these differences is equal to or more extreme than the cases we evaluated. As the two-tailed probability is the sum of all tables whose probabilities are less than or equal to the probability of the observed table, the one-sided test encompasses the two-sided test. EXERCISE 11.18 In an experiment to compare different treatments, the old method produced 4 rejects out of 20. In the second experiment, the new procedure resulted in 0 rejects out of 15. Using both the normal approximation and Fisher’s exact test, determine how statistically significant is the improvement between the new and the old methods. EXERCISE 11.19 Use the hypergeometric distribution to calculate the probabilities for all possible failure occurrence differences, assuming equal marginal totals for the experiment described in the previous exercise.

11.9.3 Fisher’s Exact Test in JMP and Minitab In JMP, we set up the data table (Table 11.5) as shown. In the main menu, we select Analyze and Fit Y by X. In the dialog box, for Y, Response, select Results. For X, Factor, choose Category, and for Freq, choose Count. Hit OK to get the an output with a Mosaic Plot, Contingency Table, and Tests. Under Tests, Fisher’s exact test results are shown as Table 11.6. Note JMP also provides the significance level for the one-sided test p2 < p1. In that situation, the probability is obtain by considering the cases for Group 2 passing of 497/500, TABLE 11.5 JMP Data Table 1 2 3 4

Category

Results

Count

Group 1 Group 1 Group 2 Group 2

Pass Fail Pass Fail

999 1 497 3

TABLE 11.6 Fisher’s Exact Test Results Fisher’s Exact Test Left Right 2-Tail

Prob Alternative Hypothesis 0.1108 Prob(Results = Pass) is greater for category = Group 1 than Group 2 0.9878 Prob(Results = Pass) is greater for category = Group 2 than Group 1 0.1108 Prob(Results = Pass) is different across category

390

Applied Reliability

FIGURE 11.6 Input for Fisher’s exact in Minitab. Test and CI for Two Proportions Sample

X

N

Sample p

1 1 1000 0.001000 2 3 500 0.006000 Difference = p (1) − p (2) Estimate for difference: −0.005 95% confidence interval for difference (−0.0120469, 0.00204688) Test for difference = 0 (vs not = 0): Z = −1.77 P−Value = 0.077 *NOTE * The normal approximation may be inaccurate for small samples. Fisher’s exact test: P−Value = 0.111 FIGURE 11.7 Output for Fisher’s exact in Minitab.

498/500, 499/500, and 500/500. Obviously, that probability will be 1 − probability of 496/500, that is, 1 − 0.01225 = 0.9878. Thus, there is little evidence to infer that the failure proportion p2 < p1. In Minitab, we select the menu items Stat – Basic Statistics – 2P 2 Proportions… In the dialogue box, select Summarized Data and enter the information as shown in Figure 11.6. Under Options…, choose Use pooled estimate of p for test, desired Confidence level, and Alternative. The example output for a two-sided test is illustrated in Figure 11.7. Note that Minitab issues a note about the normal approximation for small samples, while providing the results, along with the p value for Fisher’s exact test. EXERCISE 11.20 Repeat Exercise 11.18 using JMP or Minitab.

Quality Control in Reliability: Applications of Discrete Distributions

391

11.10 Poisson Distribution The Poisson is an important discrete distribution that models counts of events where the probability of a single event over a period of time or space is constant. For example, for calculations involving density, such as the number of defects per wafer, the Poisson distribution is frequently used to model the defect distribution. The Poisson has been applied to the number of phone calls arriving in a given period, the number of repairs in a given time period, the number of bugs in software, the number of raisins in a box of cereal, the number of people in a queue, the number of flaws per yard of insulated wire, the number of misprints per page, and so on. The equation for the Poisson distribution is given as follows:

P(X = x) =

λ x e− λ x!

x = 0, 1, 2 , ...

where λ is the average density or expected value, for example, the mean number of defects for a given area. The variance is also λ. The standard deviation is λ . In Excel and OpenOffice, the worksheet function is POISSON(x, λ, 0, or 1) where the third argument is the logical 0 or 1 for pmf or cumulative distribution output, respectively. This simple formula can also be used numerically to approximate the binomial distribution, even though the methods of deriving the distributions are different. The approximation will be accurate when N is large compared with n, and the probability p that a unit is defective is small, such that the product np is roughly less than 7. To use the Poisson as an approximation to the binomial, we use the average density, λ, set equal to the product np. EXAMPLE 11.9 POISSON DISTRIBUTION A sample of 250 units is periodically drawn and tested. If the process producing these units generates a defect level of 1%, what is the probability of getting less than two failures in the sample tested?

Solution We recognize applicability here of the binomial distribution. However, because of the large sample size and small p, we instead use the Poisson distribution with λ = 250 × 0.01 = 2.5. The probability of less than two failures is the probability of zero and one failure. So we calculate P( X = 0) =

λ x e−λ = e −2.5 = 0.082 x!

and

P( X = 1) =

2.51 e −2.5 = 0.205 1!

The total probability is thus .287, or there is roughly only a 1 in 3.5 (i.e., the reciprocal of .287) chance of getting less than two failures. EXERCISE 11.21 Calculate and compare the results using the binomial distribution to the answer given for the Poisson approximation in Example 11.9.

392

Applied Reliability

Unlike the binomial distribution, where the number of events cannot exceed the sample size, the Poisson distribution has no limit on the number of defects per unit of measure that can be counted. Hence, we distinguish between a defect and a defective unit. A unit is defective if it has one or more defects; a defect is defined as a nonconformance to specifications. Thus, the analysis of defective units is handled differently compared with the analysis of the number of defects per unit or per area. The former case can be handled by the binomial (or hypergeometric) distribution and the latter by the Poisson. EXERCISE 11.22 The average defect density is 0.01 defects per square inch. The wafer area is 200 square inches. Ten wafers are randomly selected from a lot of 50. Determine: (1) the expected number of defects per wafer; (2) the expected number of defects for the 10 wafers; (3) the probability any wafer is defect free; (4) the probability any wafer has at least one defect; (5) the probability any wafer has exactly five defects; (6) the probability all 10 wafers are defect free; (7) the probability exactly two of the 10 wafers are defect free.

The Poisson distribution also has a useful relationship to the exponential life distribution as follows: Assume, for example, that the time between emergency repair requests coming into a central office follows an exponential life distribution with parameter λ. Then, the probability of exactly k emergency repair calls in the time interval t is given by the Poisson distribution with parameter λt. In other words,

P( k repair calls) =

(λt)k e − λt k!

As a confirmation of this relationship, consider that the probability of no calls in the time interval t is the probability for k = 0, or

P( no repair calls) = e−λt

As we showed in Chapter 1, the complementary event to zero occurrences is at least one occurrence and this probability is given by

P(at least one repair call) = 1 − e− λt

This last expression is just the CDF F(t) for the exponential distribution with parameter λ. So a discrete process (e.g., the number of repair calls) described by a Poisson distribution with parameter λ t has a probability of an event occurring in the time interval t described by the continuous exponential distribution with parameter λ . EXERCISE 11.23 High energy rays strike a measuring device. Occasionally, a ray of sufficient intensity causes the device to cease functioning, and a reset of the instrumentation is required. Each reset involves replacement of a seal. If the time between the arrival of the disabling rays follows an exponential distribution with MTTF = 2190 hours, what is the probability that five or more seals will be needed in inventory for repairs during the year?

The population parameter λ for a Poisson distribution is estimated by the average observed number of occurrences over the collection of time periods, samples, objects, areas, lines, and so on. Confidence limits for the parameter λ of a Poisson distribution can be obtained using percentiles of the chi-square distribution with appropriate degrees of freedom. For x observed occurrences, the lower and upper limits for 100(1 − α)% confidence are given by ( 1 2 ) χ[22 x ,α/2] and ( 1 2 ) χ[22( x+1),1−α/2], respectively. Note the confidence limits depend only on the

Quality Control in Reliability: Applications of Discrete Distributions

393

confidence level and the observed count. In Excel and OpenOffice, the worksheet function is CHIINV(probability, degrees of freedom). For 100(1 − α)% confidence, the lower limit in Excel is 0.5 × CHIINV(1 − α/2,2x) and the upper limit is 0.5 × CHIINV(α/2,2(x + 1)). EXAMPLE 11.10 CONFIDENCE LIMITS FOR EXPECTED VALUE OF A POISSON DISTRIBUTION The manufacturer wishes to assess the level of marking defects on integrated circuit packaging. He randomly samples 100 pieces and finds 9 units with one defect, 2 with two defects, and 1 with three defects. Find the average defect density per package and provide 90% confidence limits.

Solution The total number of defects observed is 9 + 2(2) + 1(3) = 14. Thus, λˆ = 0.14 90% confidence limits are obtained from the worksheet function. In OpenOffice, 0.5 × CHIINV(0.95;28) returns the value 8.464. The upper limit 0.5 × CHIINV(0.05;30) gives the value 21.89. Thus, the 90% confidence interval on λ is (0.08464 0.2189). EXERCISE 11.24 Based on the results in Example 11.10, estimate the probability of 0, 1, 2, 3 defects using the Poisson distribution. How many units in 100 are expected to have 0, 1, 2, 3 or more defects? Compare with the data. Suppose a second sample of 100 is taken and 16 total defects are observed. Provide a 99% confidence interval for the expected number of defects based on the combined samples totaling 200 units.

In estimating confidence limits, one must clearly distinguish between Poisson confidence limits on the total count expected during a given interval and confidence intervals for the average count per unit of measure. EXAMPLE 11.11 POISSON CONFIDENCE LIMITS Suppose we open three boxes of cereal and find a total of nine raisins. The average estimate for λ is three raisins per box. The 90% confidence interval on the total count for any new three boxes is 4.70–15.71, and on the average λ, the interval is 1.57–5.24. Note that the width of the confidence interval is 3.67. Suppose we open 10 boxes in all and find a total of 30 raisins, and the average number of raisins per box is three, as before. The 90% confidence interval on the total count for any new 10 boxes is now 21.59–40.69, but the 90% confidence interval on the average λ per box is 2.159–4.069. Note that the confidence interval width for λ is now 1.91, because we have a larger sample. The width of the confidence interval on λ depends on the sample size. EXERCISE 11.25 1000 randomly selected lines of code are inspected. 23 bugs are found. Assuming a Poisson distribution: (a) What is the estimate of the expected number of bugs per 1000 lines of code (KLOC) in the population? (b) Provide a 90% confidence interval on the expected number of bugs per KLOC in the population.

11.11 Types of Sampling Consider a lot consisting of a very large number of packaged integrated circuits. Suppose we are concerned about a particular mode of failure. Assume that the entire lot (the population) would experience failures at some level, perhaps 1%, if the entire lot were stressed

394

Applied Reliability

a certain length of time, say 100 hours, under normal operating conditions (voltage, temperature, and humidity). However, we wish to test only a small sample under similar operating conditions for the specified time to estimate the population percent failing and infer if the lot has acceptable reliability. How do we choose an appropriate sample size? And how do we make reasonable decisions about the lot reliability based on the sample results? Assume we make a decision based on the results of looking at a one sample of size n. This type of analysis is called “single-sampling.” It is also possible to perform double sampling wherein the results of the first sample drawn from a lot generate one of three possible outcomes: (1) accept the lot; (2) reject the lot; or (3) take an additional sample and combine the results to reach a final decision. In fact, multiple sampling plans are possible in which more than two draws from the population are performed before a decision is reached. For further discussion of multiple sampling plans, the reader should consult Burr (1976) or Schilling (1982). Also, MIL STD 105E (1989), replaced by American National Standard ANSI/ASQC Z1.4-2003, contains actual single, double, and multiple sampling schemes for inspection by attributes. In this chapter, only single-sample plans will be discussed. Since one either accepts or rejects a lot or process in single-sampling plans, the probability of acceptance plus the probability of rejection is equal to one. Hence, rejection and acceptance probabilities will be used interchangeably. 11.11.1 Risks To understand fully the sampling process, one must comprehend the risks involved with the decision to either accept or reject a lot. A straightforward approach might be to establish an acceptable level or a maximum level of allowed percent defective, say Y%. Then, we could choose a convenient sample size for stressing, say 50 units, and set the criteria based on the maximum number of failures in the sample on stress such that we have specified probabilities of accepting or rejecting the lot. If we have more failures than the acceptance number (call it c), we reject the lot; if we have c or fewer failures, we accept the lot. The acceptance number recognizes the fact that a large lot with a fixed fraction defective produces varying results for each sample drawn, even if the sample size is the same. However, we can calculate an acceptance number such that a lot with a fraction defective greater than Y is likely to have a number of rejects in the sample that exceeds the acceptance number. Conversely, if the lot fraction defective is not excessive, we can calculate how likely it is to see more rejects than c in a given sample. We can also calculate how likely we are to not exceed c for a given excessive lot defect level. Let us consider these risks of making correct or wrong decisions in more detail. There exists a matrix of possible correct and incorrect decisions, shown as Table 11.7. If the lot is truly less than Y% defective, and we make a decision based on the sample results to accept the entire lot, we have made a correct decision. If the lot is greater than Y% TABLE 11.7 Matrix of Possible Choices Population Value (% Defective) ≤Y% Decision on lot

Accept Reject

Correct Type I error

>Y% Type II error Correct

Quality Control in Reliability: Applications of Discrete Distributions

395

defective, and the sample results cause us to reject the lot, we again have made the correct decision. If, however, the lot is less than Y% defective and the sample outcome leads to lot rejection, then we would have committed an error (called a Type I error). The probability of this wrong decision is referred to as an “α” (alpha) risk. Similarly, a Type II error, at a “β” (beta) risk level, occurs when sample results cause lot acceptance for a lot that is actually greater than Y% defective. The terms “producer’s risk” and “consumer’s risk” are often applied to the alpha and beta risks, respectively, because Type I error refers to the rejection of a good lot and Type II error designates the acceptance of a bad lot. Obviously it is costly to a producer to throw away or rescreen acceptable material due to Type I error. Similarly, the cost to the user of taking product with a high reject rate, because of a Type II error, may be considerable if, for example, he has to replace or repair defective systems caused by bad components. We now consider the calculation of the probabilities of these possible events. 11.11.2 Operating Characteristic Curve If we are sampling from a finite lot, the distribution of the possible outcomes is described by the hypergeometric distribution model. However, when the sample size selected is small (roughly 10% or less) relative to the lot size, the binomial distribution is useful as an excellent approximation to the hypergeometric distribution. Also, the binomial distribution would be applicable for sampling from a continuous, stable process. We assume the validity of the binomial distribution for the following discussion. Our first step is to calculate the probability of getting an acceptable lot based on c or less failures out of the n (say 50) units on stress. Assume that the lot percent defective is actually 2%. We further desire that the acceptance number c be such that at least 95% of the time we accept the lot if the true percent defective is 2% or less; that is, our risk of Type I error is 5%. We can calculate the cumulative binomial probabilities for 1, 2, 3, and so on, up to consecutive trial values of c failures, for n = 50, p = 0.02, until the probability of getting a particular c or less failures exceeds 95% or 0.95. Then, at least 95% of the time, we will accept a lot with 2% defective based on that choice of c for an acceptance number. 11.11.3 Binomial Calculations The formula for the cumulative binomial probability is c

P(X ≤ c) =

∑ x= 0

n x p (1 − p)n− x x

So we require c

∑ x= 0

50 (0.02)x (0.98)50− x ≥ 0.95 x

We can easily set up a spreadsheet with three columns (i.e., x, binomial probability, and cumulative binomial probability) as shown in Table 11.8 to evaluate the binomial formula through the function BINOMDIST(x, n, p, 0 or 1). At x = 3, we stop because the cumulative binomial probability, that is, P(X ʺ 3), is greater than .95. However, a much simpler procedure is to apply the spreadsheet function

396

Applied Reliability

TABLE 11.8 Binomial Probability Calculations for Sample of Size n = 50 and p = 0.02 A

B

C

Binomial Probabilities 1 2 3 4 5 6 7 8 9

n p

50 0.02

x 0 1 2 3

P(X = x) 0.3642 0.3716 0.1858 0.0607

P(X ≤ x) 0.3642 0.7358 0.9216 0.9822

TABLE 11.9 Probability of Three or Less Failures in Sample of Size n = 50 for Various Lot Percent Defective Values Lot Percent Defective p

Probability P( X ≤ 3)

0.01 0.02 0.03 0.05 0.07 0.10 0.13 0.15

0.9984 0.9822 0.9372 0.7604 0.5327 0.2503 0.0958 0.0461

CRITBINOM(n, p, 1 − α), which provides the smallest value c for which the cumulative binomial distribution meets or exceeds (1 − α). Thus CRITBINOM(50, 0.02, 0.95) returns the value 3. By either method, the desired c number is 3, that is, we accept a lot if 3 or less units fail out of the 50 sampled; we reject the lot if 4 or more fail. Our risk of committing a Type I error will then be less than 5%. Now what about the risk of a Type II error, that is, accepting a bad lot defined as over 2% defective? To calculate the risks associated with the Type II error, it is not sufficient to state only one defect level; one needs to specify the various alternative percent defective values and perform each calculation separately. For example, one may be interested in the probability of getting 3 or less failures out of 50 units sampled (and thereby accepting the lot) if the true population value is 3%, 5%, 7%, 10%, or 15%, and so on. We have performed such calculations and the numbers are shown as Table 11.9. 11.11.4 Examples of Operating Characteristic Curves This table may be better represented in the form of a graph, called the OC curve, which details the probability of accepting a lot based on c allowed rejects in a sample of size n for various lot percent defective values. Such a curve is shown, for n = 50, c = 3, in Figure 11.8. The OC curve provides the total picture of the sampling plan. We note, for example, that if a lot having 7% defective units is presented for sampling, there is about a 50–50 chance

397

Quality Control in Reliability: Applications of Discrete Distributions

OC curve

1.00

n = 50

c=3

Probability of acceptance

0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00

0%

2%

4%

6%

8% 10% 12% 14% 16% Lot percent defective

18% 20%

FIGURE 11.8 Operating characteristic curve.

1.00

Probability of acceptance

0.90 0.80 n = 50

0.70 0.60 0.50

c=0 c=1 c=2 c=3

0.40 0.30 0.20 0.10 0.00

0%

5%

10% 15% Lot percent defective

20%

25%

FIGURE 11.9 Operating characteristic curves for different acceptance numbers.

of its being accepted; we call this percent defective the “point of indifference.” If all lots come in with exactly that percent defective, you might as well flip a coin and accept lots when you get “heads”—the results will be as equivalent, in the long run, as sampling 50 units with a c number of 3. Also, we have about a 10% chance of accepting a lot offered at around 12.5% defective, or over six times the 2% value we targeted in setting up this sampling plan. Stated another way, the consumer’s risk is 10% that a lot at 12.5% will have three or less failures in a sample of 50 and thus be accepted. How can the Type II or consumer’s risk be reduced? Obviously, one can decrease the acceptance number to c = 2, 1, or even 0, but what would that change mean to the OC curve? Figure 11.9 shows the OC curve for various acceptance numbers with the sample

398

Applied Reliability

size fixed at 50. We see that the probability of acceptance has been reduced at all incoming percent defective levels. At the target value of 2%, the probability of acceptance goes down from 0.98 to 0.92 to 0.74 to 0.36 as we change c from 3 to 2 to 1 to 0, respectively. Thus, we have greatly increased the probability of rejecting an acceptable lot, that is, the producer’s risk, by lowering the acceptance number while keeping the sample size fixed. Similarly, if only the sample size increases and the acceptance number c is held constant, then again the consumer’s risk can be reduced but at the expense of the producer’s risk. Figure 11.10 shows the OC curve for samples of size n = 50, 100, 300, and 500 and acceptance number c = 3. The effect of increasing the sample size while the acceptance number remains fixed is to significantly reduce the probability of a lot acceptance at all percent defective values. Thus, the consumer’s risk is driven dramatically lower, but the producer’s risk becomes very high. For example, the probability of accepting a lot at 2% defective, allowing three rejects in a sample of size n = 50, is about 98%; hence, the Type I risk is only 2%. However, sample sizes of 100, 300, and 500 will have a probabilities of acceptance of 86%, 15%, and 1%, respectively, for the same acceptance number c = 3. So the producer’s risk is correspondingly 14%, 85%, and 99%. However, for the consumer’s risk held at 10%, the corresponding percent defective values are approximately 13%, 6.5%, 2.2%, and 1.1% for samples of size 50, 100, 300, and 500, respectively. In fact, the only way to decrease both Type I and II errors simultaneously is to increase the sample size drawn while also adjusting the acceptance number. What we would like to do, indeed, is to specify a high probability of acceptance, say 95%, at a desirable or acceptable quality level and simultaneously require a high probability of rejection, say 90%, at an undesirable or rejectable quality level. For individual lots, it is common to call the percent defective value at the 95% acceptance probability the “AQL,” for acceptable quality level. If a lot is rejected, we can loosely state with 95% confidence that the lot defect level is above the AQL. Similarly, the 10% acceptance probability is referred to as the “LTPD,” or lot tolerance percent defective or even sometimes as the “RQL,” for rejectable quality level. If a lot

1.00 0.90 Probability of acceptance

0.80 c=3

0.70 0.60 0.50 n = 50

0.40 n = 100

0.30 0.20

n = 300

0.10

n = 500

0.00

0%

5%

10% 15% Lot percent defective

FIGURE 11.10 Operating characteristic curves for different sample sizes.

20%

25%

Quality Control in Reliability: Applications of Discrete Distributions

399

is accepted, we can also loosely state with 90% confidence that the lot defect level is below the LTPD. Note, however, that the term AQL is often applied in the literature in the context of an specified quality level without being tied to a given probability point, such as 95%. Furthermore, military documents, such as MIL STD 105E (1989), refer to an AQL value assured in terms of an overall sampling scheme with different acceptance probabilities for a given AQL, depending on lot sizes and other factors. In contrast to the incoming lot defect level, we are often interested in the outgoing quality based on a specific sampling plan. We will make an assumption that a rejected lot is screened through 100% inspection and all defects are removed. Consequently, the defect level in the rejected lot is made to be zero. The lots that pass will have the incoming defect level. As a result, the actual defect level of the product shipped will be a mixture of lots at a certain defect level and perfect lots from rescreening. Let PA be the probability of accepting a lot at incoming defect level. Then, the average outgoing quality AOQ is given by the theorem of total probability (see Chapter 1) as AOQ = PA p + (1 − PA )(0) = PA p

Basically, we take any point on an OC curve and multiply the y-value (PA) by the x-value (p) to generate an AOQ curve. Such a curve is shown in Figure 11.11 for the sampling plan (n = 50, c = 2). Note the AOQ curve peaks. For low defect levels of incoming quality, the outgoing quality is also low. For high incoming defect levels, more lots are rejected and screened, removing the defects, and thus, the outgoing quality defect level is also low. Where the curve peaks is called the AOQL, the average outgoing quality limit. The AOQL represents the worst possible average outgoing quality from this sampling plan. Here, the AOQL is approximately 2.75% and this value would only occur if all the incoming lots had defect levels of about 4.5%. Also, there are other schemes that are based on a specific characteristic, such as the LTPD for a single lot or the AOQL for a series of lots when it is feasible to screen out defects.

AOQ curve

Average outgoing quality

3.0%

n = 50

c=3

2.5% 2.0% 1.5% 1.0% 0.5% 0.0% 0%

2%

4% 6% 8% 10% 12% 14% 16% 18% 20% Lot percent defective

FIGURE 11.11 AOQ curve with AOQL.

AOQL

400

Applied Reliability

The book by Dodge and Romig (1959) provides extensive tables for both the LTPD and AOQL procedures. Full information on the subject of acceptance sampling can be found in the book by Schilling (1982). Section 11.12 presents several ways of generating sampling plans to provide specific risk protection at given acceptable and rejectable defect levels. EXERCISE 11.26 Generate the OC curve for n = 100 and c = 2. EXERCISE 11.27 Generate the family of OC curves for n = 50, 100, 300, and 500 for c = 2. EXERCISE 11.28 Generate the family of OC curves for c = 0, 1, 2, 3 and n = 100. EXERCISE 11.29 Generate AOQ curves for c = 0, 1, 2, 3 and n = 100. Find the AOQLs for each sampling plan.

11.12 Generating a Sampling Plan A sampling plan is uniquely determined by the lot size, N, the sample size, n, and the acceptance number, c. For a sufficiently large lot or for sampling from a process, the N parameter can be ignored. We shall do so here. Thus, n and c uniquely specify the sampling plan. An easy way to obtain n and c for a given set of percent defectives and matching α, β risks is to use a computer program that does the calculations. Several software packages for acceptance sampling are available for personal computers. The format is simple: one typically provides the acceptable and rejectable quality levels (AQL = p1, and RQL = p2) and the associated α, β risks, and the program then generates the sampling plan, that is, the sample size and acceptance number. Also, the program may provide the OC curve and other associated plots. However, it is fairly straightforward using spreadsheet functions to generate a sampling plan. First, we need to describe the approach. Guenther (1974) presents a very 2 simple iterative procedure, using the percentiles χdf;area of the chi-square distribution. The equation for determining the minimum sample size n and the accompanying acceptance number c is

0.5 × χ 22 c+ 2 ;(1−β ) ×

1 1 − 0.5 + c ≤ n ≤ 0.5 × χ 22 c+ 2 ;α × − 0.5 + c p2 p1

For a given set of conditions ( p1 , p2 , α , β) , one simply tries a c-value, using the chi-square percentiles for 2c + 2 degrees of freedom and the respective probabilities, and determines if the inequalities for the sample size n are mathematically satisfied. If so, then the minimum n is chosen. If not, then the c-value is incremented by 1, until the interval for n contains at least one integer. In Excel or OpenOffice, the two argument spreadsheet

Quality Control in Reliability: Applications of Discrete Distributions

401

function CHIINV(probability, degrees_of_freedom) can provide the desired chi-squared percentiles. To evaluate χ 22 c+ 2 ;α, we use CHIINV(1 − α,2c + 2). For χ 22 c+ 2 ;(1−β ), the function is CHIINV(β,2c + 2). Thus, for an alpha risk of 5% with 8 degrees of freedom, CHIINV(0.95,8) returns the value 2.733. For a beta risk of 10%, CHIINV(0.10,8) gives the result 13.362. For example, let us find a sampling plan with acceptance probability of .95 for product at 2% defective and rejection probability of .90 for product at 8% defective. If we try c = 0, 1, 2, or 3 in the above formula, we find that no solutions exist, that is, the inequality does not hold. To illustrate for c = 3, we get

0.5 × χ 28 ;0.90 ×

1 1 − 0.5 + 3 ≤ n ≤ 0.5 × χ 28 ;0.05 × − 0.5 + 3 0.02 0.08

0.5 × [13.36(12) + 3] ≤ n ≤ 0.5 × [2.73( 49.50) + 3]

81.66 ʺ n ʺ 69.07

and hence no solutions exists. However, for c = 4, we discover

2 2 0.5 × χ10 ;0.90 × (12) + 4 ≤ n ≤ 0.5 × χ 10 ;0.05 × ( 49.50) + 4

0.5 × [15.99(12) + 4] ≤ n ≤ 0.5 × [3.94( 49.50) + 4]

97.94 ʺ n ʺ 99.52

and the minimum sample size is 98. The sampling plan is n = 98 and c = 4. Figure 11.12 shows a spreadsheet set up to solve the previous problem. As a check on the solution, we can use the function BINOMDIST with the arguments c, n, p1, and 1 for the acceptable quality level p1 = 0.02. Thus, BINOMDIST(4,98,0.02,1) yields

FIGURE 11.12 Spreadsheet set-up for determining acceptance sampling plan.

402

Applied Reliability

95.3%. Similarly, for the rejectable quality level p2 = 0.08, we use BINOMDIST(4,98,0.08,1) to get 9.9%. EXERCISE 11.30 Apply the Guenther method using spreadsheet functions to determine the sample size and acceptance number for an acceptable quality level of 0.01 with an alpha risk of 0.025 and a rejectable quality level of 0.05 with beta risk of 0.10. EXERCISE 11.31 Find a sampling plan to achieve AQL = 0.01, RQL = 0.03, α = 0.01, β = 0.10.

11.12.1 LTPD Sampling Plans There is a special category of sampling plans known as LTPD sampling plans. These plans are based on a fixed consumer’s risk of 10%, which assures that a lot having a specified RQL will be rejected with 90% confidence. The RQL at 10% probability of acceptance is commonly called the LTPD, for lot tolerance percent defective. LTPD plans are used for many product qualification plans to assure consumer protection. Because of constraints on sample size due to limited time, money, or other resources, there is often the need in industry to reduce sample sizes and the corresponding acceptance number while holding the LTPD constant. For example, a qualification plan calls for stressing 300 devices for 1000 hours under specified conditions. The requirement for passing is a maximum of three failures. However, early in the manufacturing development cycle the product is very expensive to make, and consequently, the engineers want to reduce the sample size and acceptance number for qualification while maintaining the consumer’s risk and rejectable quality level. It is for such an objective that LTPD plans are useful. In manufacturing areas, one may see quality control workers use a specially prepared table such as the one from MIL-M-38510J (1991) downloadable from http://www.everyspec. com/MIL-SPECS/MIL+SPECS+(MIL-M)/MIL-M-38510J_14597/, shown here as Table 11.10, for LTPD sampling plans. The use of this table is quite simple. The LTPD is listed across the top row as a percentage. The acceptance number c is the left-most column. The intersection of a specified LTPD and acceptance number is the minimum sample size required. A method for determining the AQL, that is, the defect level accepted 95% of the time, for a particular (n,c) pair is described later in this section. For example, to provide an LTPD of 2% with an acceptance number of 2, one would need at least 266 units to test. The AQL for this sampling plan of n = 266 and c = 2 can be shown to be 0.31%. However, tables have limitations, because the LTPD values are restricted to only those listed. Using the qualification plan stressing 300 devices for 1000 hours as an example, we see the sample size of 300 is not exactly provided for when c = 3 in Table 11.7. Although possibly adequate, the ranges of sample sizes and acceptance numbers do not cover all possibilities. Hence, it is beneficial to see how we can solve LTPD problems in a more general way by using spreadsheet functions. Let us continue with this example. The first step is to determine the LTPD value for the sampling plan n = 300, c = 3. We seek an LTPD value p such that the probability of c = 3 or less rejects is less than or equal to .10. The cumulative binomial formula is 3

P(X ≤ 3) =

∑ x= 0

300 x p (1 − p)300− x ≤ 0.10 x

LTPD Sampling Plans LTPD Sampling Plansa Minimum size of sample to be tested to assure, with 90% confidence, that a lot having percent-defective equal to the specified LTPD will not be accepted (single sample) Max. Percent Defective (LTPD) or λ

50

30

20

Acceptance No. (c) (r = c + 1) 0 1 2 3 4 5 6 7 8 9 10

15

10

7

5

3

2

1.5

1

0.7

0.5

0.3

0.2

0.15

0.1

1152 1946 2662 3341 3997 4638 5267 5886 6498 7103 7704

1534 2592 3547 4452 5327 6181 7019 7845 8660 9468 10268

2303 3891 5323 6681 7994 9275 10533 11771 12995 14206 15407

Minimum sample sizes (for device-hours required for life test, multiply by 1000) 5 8 11 13 16 19 21 24 26 28 31

8 13 18 22 27 31 35 39 43 47 51

11 18 25 32 38 45 51 57 63 69 75

15 25 34 43 52 60 68 77 85 93 100

22 38 52 65 78 91 104 116 128 140 152

32 55 75 94 113 131 149 166 184 201 218

45 77 105 132 158 184 209 234 258 282 306

76 129 176 221 265 308 349 390 431 471 511

116 195 266 333 398 462 528 589 648 709 770

153 258 354 444 531 617 700 783 864 945 1025

231 390 533 668 798 927 1054 1178 1300 1421 1541

328 555 759 953 1140 1323 1503 1680 1854 2027 2199

461 778 1065 1337 1599 1855 2107 2355 2599 2842 3082

767 1296 1773 2226 2663 3090 3509 3922 4329 4733 5133

Sample sizes are based upon the Poisson exponential binomial limit. Source: MIL-M-38510J, General Specifications for Military Specification Microcircuits, Table B-1, Government Printing Office, Washington, DC, 1991. a

Quality Control in Reliability: Applications of Discrete Distributions

TABLE 11.10

403

404

Applied Reliability

We use the relationship between the binomial distribution and the beta function to write 4− 1

1 − I p ( 4300 − 4 + 1) =

∑ k =0

300 k p (1 − p)300− k ≤ 0.10 k

or

I p ( 4, 297 ) ≥ 0.9

Next, we apply the spreadsheet function BETAINV(0.9,4,297) to obtain the value p = 2.21% for the LTPD. Such a procedure was used to create the plot of LTPD values versus sample size for various acceptance numbers, as shown in Figure 11.13. With p = LTPD = 0.0221 determined, we now need to solve for n for different c values to satisfy the binomial formula. c

P(X ≤ c) =

∑ x= 0

n 0.0221x (1 − 0.0221)n− x ≤ 0.10 x

Obviously for c = 0, it is a simple matter to solve the resulting equation for n P(X = 0) = (1 − 0.0221)n ≤ 0.1

or

n≥

ln(0.1) ln(1 − 0.0221)

and obtain the value n = 103. For c > 0, it is still relatively simple to find the corresponding n using the spreadsheet routine Goal Seek. We first set up two columns each with four rows as shown in Table 11.11. The first column contains the labels c, LTPD, n, and beta risk.

25%

LTPD versus sample size for different acceptance numbers c=0 c=1 c=2 c=3 c=4 c=5 c=6 c=7 c=8 c=9 c = 10

20%

LTPD

15%

10%

5%

0% 10

100 Sample size

FIGURE 11.13 LTPD versus sample size for different acceptance values.

1000

Quality Control in Reliability: Applications of Discrete Distributions

405

TABLE 11.11 LTPD Evaluation C LTPD n (trial) Beta risk

2 0.0221 100 0.6185

Into the adjacent column, we enter the c-value of interest (e.g., 2), the LTPD value (here, p = 0.0221), a trial number for n (e.g., 100), and the function for the cumulative binomial distribution = BINOMDIST(c,n,LTPD,1). The beta risk evaluates to 0.6185. We could manually change the sample size n until the beta risk is less than 0.1. However, Goal Seek will make our job easier. Under Tools in the main spreadsheet menu, we select Goal Seek and tell the program to set the cell (formula cell in OpenOffice) containing the beta risk to 0.1 (target cell in OpenOffice) by changing the cell (variable cell in OpenOffice) for sample size n. The program returns the value for n of 239.62, and we verify that 239 is sufficient to keep the beta risk under 0.1. We showed that the LTPD or RQL (at 90% probability of rejection) associated with any sampling plan (n,c) can be found by using the inverse beta spreadsheet function BETAINV(0.90,c + 1,n − c). Similarly, to determine the AQL (at 95% probability of acceptance) for any sampling plan (n,c), we use the function BETAINV(0.05,c + 1,n − c). In this manner, we find that the AQL for the original sampling plan (3,300) is 0.45%, but the AQL for the (0,103) plan is only 0.053%. Thus, the product sampled must have a lower percent defective in order to consistently pass the plans having smaller acceptance numbers. Such is the price one pays to reduce the sample size while holding the same LTPD. A nontechnical way of showing this AQL shift is to observe that the allowed fallout to pass the test goes down from three failures out of 300 (or 1%), to two failures in 223 (or 0.9%), to one failure in 165 (or 0.6%), to zero failures in 100 (or 0%) as the sample size decreases. Of course, once a plan is chosen, we can immediately generate the OC curve to check the risks at other p values. EXERCISE 11.32 For the same AQL, RQL, and alpha and beta risk levels as in Exercise 11.30, use Table 11.10 to determine a suitable sampling plan, that is, n and c. EXERCISE 11.33 Repeat previous exercise using the same AQL, RQL, and alpha and beta risk levels as in Exercise 11.31. EXERCISE 11.34 Using the Goal Seek function, find the necessary sample size for an LTPD = 2.21% and a beta risk of 0.1 for the acceptance numbers c = 0 and 1. EXERCISE 11.35 Determine the AQL associated with the LTPD = 2.21% for the acceptance numbers c = 0 and 1 and the sample sizes obtained from Exercise 11.34. EXERCISE 11.36 The qualification requirements allow a maximum of two failures on 500 devices stressed. What is the AQL and RQL at α = 0.05 and β = 0.10 risks?

406

Applied Reliability

EXERCISE 11.37 The qualification requirements allow a maximum of two failures on 500 devices stressed. The engineer wishes to reduce the sample size. Determine the sample size to hold the same RQL at the same β risk for acceptance number c = 1. Repeat for c = 0. What are the AQL’s for each case at α = 0.05? EXERCISE 11.38 The lot acceptance criteria allows a maximum of three failures on 300 devices inspected. The manufacturer wants to reduce the sample size for inspection. Determine the sample size to hold the same AQL at the same 5% producer’s risk for the acceptance number c = 2. Repeat for c = 1 and c = 0. What are the RQL’s for each case at consumer’s risk = 0.10?

11.13 Minimum Sample Size Plans Suppose we wish to protect against a Type II error using the smallest sample size possible, and Type I error is not a primary concern. This situation may occur, for example, when parts are limited in availability or are highly expensive or time consuming to test. Minimum sampling plans are based on an acceptance number of zero, that is, c = 0, because any number greater than zero would require a larger sample size be used for the same rejectable percent defective level. If we assume the sample size is small relative to the lot size or the sample is drawn from an ongoing process, then the binomial distribution applies. Hahn (1979) treats the situation where the lot size does matter and develops curves based on the hypergeometric distribution; at the lower percent defective values, however, considerable “eyeball” interpolation is involved. The determination of the minimum sample size is quite simple. Since the acceptance number is zero, we want the probability of zero failures to be equal to the consumer risk level, that is, the β or Type II error. Thus, we have

(1 − p)n = β

Solving for n gives

n=

ln β ln(1 − p)

as the minimum sample size necessary to assure a maximum Type II error risk of β if the true population fraction defective is at least as high as p. For example, to protect against a fraction defective higher than p = .015 with 90% confidence, the minimum sample size is about 150 units.

11.14 Nearly Minimum Sampling Plans Some engineers (not to mention managers) might feel uncomfortable about making a decision based on no failures. The common feeling is that some consideration has to be given to the producer’s risk, even at the cost of additional units. The nearly minimum type of

Quality Control in Reliability: Applications of Discrete Distributions

407

TABLE 11.12 Spreadsheet for Nearly Minimum Sampling Plans Nearly Minimum Sampling Plans c LTPD Beta risk n Cum Binom

1 0.015 0.05 315.0707 0.049615

plan allows at most one failure to occur before the lot or process is rejected. Of course, the sample sizes are higher than the minimum sampling case (c = 0) for the same confidence level and fraction defective. The derivation of the nearly minimum sampling plan is straightforward. We want the probability of zero or one failures to be set at the consumer risk level. Then, based on the binomial distribution, we require

(1 − p)n + n(1 − p)n−1 p = β

This equation cannot be solved explicitly for n, but it is a simple procedure to calculate corresponding sample sizes n for various p values using the Goal Seek spreadsheet feature. Set up the spreadsheet as shown in Table 11.12 with the cell for the cumulative binomial distribution containing the formula BINOMDIST(1,n,LTPD,1). Enter a trial value for n. Then tell the Goal Seek routine to make the formula cell equal to the value for the beta risk by changing the n cell. As the figure shows, one would need 315 units to assure with 95% confidence that the population fraction defective is no higher than .015, assuming not more than one failure occurs in the sample. EXERCISE 11.39 Determine the minimum sample size to provide protection at 200 PPM for a β = 0.1 risk level. EXERCISE 11.40 Determine the nearly minimum sample size to provide protection at 200 PPM for a β = 0.05 risk level.

Besides single-sampling plans, there are double-sampling plans in which, under specified conditions, there may be the need to take a second sample before a lot decision is made. There are also multiple and sequential sampling plans. For further information, the interested reader should consult Montgomery (2009).

11.15 Relating an OC Curve to Lot Failure Rates Lot acceptance sampling plans can be designed to evaluate and assure reliability. A sample from a lot is tested for t hours at high stress and the proportion of reliability failures generated is the sample defect level. Intuitively, if this defect level is controlled and low enough,

408

Applied Reliability

the product will have an acceptable failure rate. Often, however, it is difficult to quantify what defect level is acceptable and how any defect level relates to field failure rates. To convert the proportion defective scale on an OC chart to a field failure rate scale, we need a model that relates fallout at high stress for t hours to a normal use average failure rate (AFR). In this section, we will assume that model is known to the extent that we have an acceleration factor A that converts test hours to field hours, and we also know the life distribution. From this information, we will derive the equations that can be used to transform average failure rate objectives to an OC chart scale and vice versa. For the exponential distribution, the calculation is simple. We convert a proportion p failures at t hours of stress to p failures at the equivalent At hours of field use and solve the following equation for λ:

1 − e− λAt = p

obtaining λ = [− ln(1 − p)]/At. This value of λ is the AFR (see Chapter 2) that corresponds to p. For example, if the stress runs 48 hours and the acceleration factor for a particular mechanism is 100, then a fraction fallout level of 0.01 corresponds to an AFR of 0.21%/K hour. Conversely, an AFR objective of 500 FITS translates into a p value of .0024 for the same At. The conversion for a Weibull or a lognormal requires an additional assumption, because these distributions have two parameters. We have to assume the shape parameter (m or σ) is known for the product and does not vary significantly from lot to lot. Lot quality causes the characteristic life c or the median T50 parameter to vary, accounting for changes in the failure rate. This assumption is strong—but experience has shown it is often reasonable. To obtain the equations to go from failure proportion p to a Weibull AFR (assuming m is known), we start with the following equation:

p = 1− e

−

At c

m

Solving for the characteristic life c, we get

c=

At [− ln(1 − p)]1/m

The Weibull AFR (see Table 4.3.) over the period 0 to U hours is

U c U

m

=

U m−1 cm

Substituting for c, we obtain the desired equation expressing the AFR in terms of the proportion failing p:

AFR =

U m−1 [− ln(1 − p)] ( At)m

where U is the period of use lifetime of interest for the average failure rate calculation. Similarly, p can be written in terms of the AFR as

Quality Control in Reliability: Applications of Discrete Distributions

p = 1− e

−

409

( At )m ⋅AFR U m− 1

To get the equations for a log-normal failure mode (assuming σ is known), we start with ln

p=Φ

At T50 σ

Solving for T50 , we obtain T50 = At e− σΦ−1 ( p )

As we saw in Chapter 2, the AFR over the period 0 to U hour is

AFR =

− ln[R(U )] − ln[1 − F(U )] = U U

From Chapter 5, the lognormal CDF F(U ) = Φ

AFR =

ln(U/T50 ) and so σ

− ln 1 − Φ ( ln(U/T50 )/σ ) U

where T50 = At e− σΦ−1 ( p ). Similarly. p can be expressed in terms of the AFR as ln

p = Φ Φ −1 (1 − e−U ⋅AFR ) −

U At σ

By using the formulas in this section, lot acceptance sampling plans can be defined in terms of an AQL field average failure rate and an LTPD field average failure rate. Conversely, we can determine a sampling plan based on translating AQL and LTPD field failure rates into defect levels p occurring under accelerated stress testing. EXAMPLE 11.12 SAMPLING PLAN FOR ACCELERATED STRESS, WEIBULL DISTRIBUTION The acceptable (AQL at 5% risk) field average failure rate for 4000 hours is 330 FITS. The rejectable (LTPD) field average failure rate is 1000 FITS. If the failure distribution is Weibull with characteristic life c = 0.8 and the acceleration factor A is 301 for the particular failure mode of concern, what are the sample size and acceptance numbers for a stress running for 168 hours?

Solution Using the above equation that expresses the proportion failing p in terms of the Weibull AFR, we see that AFR = 330 FITS translates to p = .01, and AFR = 1000 FITS, into p = .03. Thus, from either

410

Applied Reliability

Table 11.7 or using methods discussed in this chapter, the sampling plan for these p values at risks α = 0.05 and β = 0.10 is n = 390 and c = 7. EXERCISE 11.41 The AQL field average failure rate for 40,000 hours is 180 FITS. The LTPD field average failure rate is 520 FITS. If the distribution of failures is log-normal with sigma equal to 4.0, and the acceleration factor A = 500, determine the sampling plan in terms of n and c for a lot acceptance test lasting t = 48 hours. EXERCISE 11.42 The AQL field average failure rate for 24,000 hours is 50 FITS. The LTPD field average failure rate is 254 FITS. If the distribution of failures is Weibull with shape parameter equal to 0.75, and the acceleration factor A = 205, determine the sampling plan in terms of n and c for a lot acceptance test lasting t = 2000 hours.

11.16 Statistical Process Control Charting for Reliability We have discussed procedures for lot acceptance. However, statistical process control techniques can also be used for monitoring the reliability of a process. Basically, the fraction (or percent) defective values from periodic samples are plotted on a graph. This chart, called a control chart, has a line, called an upper control limit (UCL), that indicates when the plotted fraction defective is significantly higher than the normal process average. The difference between an UCL and an engineering specification is critical: The control limit is based on the recent history of the process; the engineering specification may have no relationship to the actual process performance. The UCL is based on the averages of samples. An engineering specification is usually applied to individual units. Thus, we can have a situation in which the process is “in spec” but yet out of control (and vice versa)! The UCL is simple to calculate. If p is the historical process average, then the upper control limit is commonly based on the normal approximation to the binomial distribution. The formula is

UCL = p + z1−α

p(1 − p) n

where n is the periodic sample size and z1−α is the standard normal variate value corresponding to a desired (1 − α) × 100% confidence level. Typically z1−α is set at 2 or 3 to roughly correspond to one-sided probabilities of 97.7% or 99.9%, respectively. The interpretation is that there are only two chances in one hundred (or one chance in a 1000 for the higher probability) that a sample of size n would have a percent defective above the 2 (or 3) “sigma” limit. Therefore, when the control limit is exceeded, we assume the process is out of control and corrective action takes place. Otherwise, the process is assumed operating normally and no interference is required. The term “sigma” refers to the fact that the equation for the UCL involves the term for the standard error of the binomial distribution along with the normal distribution approximation for the probabilities. If the periodic sample size is nearly constant (i.e., any one lot is within 30% of the average sample size), then it is not necessary to adjust the UCL for each sample. Instead, one

411

Quality Control in Reliability: Applications of Discrete Distributions

Sample proportion

0.2

UCL

0.1

0

Mean

0

10

20

30 40 Sample number

50

60

FIGURE 11.14 Three-sigma control chart for binomial proportions.

constant line will serve as the control limit. Figure 11.14 illustrates a typical “p-type” control chart for attribute data. The normal approximation is commonly used for setting the limits. The exact probabilities associated with these limits can be found by application of the cumulative binomial distribution. This, for example, assuming n = 100 and process average p = .1, the 3 sigma UCL is 0.19. Using spreadsheet functions, we see the corresponding normal probability is 1 − NORMSDIST(3) = .00135. The exact binomial probability is 1 − BINOMDIST(19,100,.1,1) = .00198. There are also control charts for variables data involving both the sample means and the sample ranges and for Poisson distributed data. Many excellent books on the subject of statistical quality control exist: see Burr (1976, 1979), Deming (1982), Duncan (1986), Grant and Leavenworth (1988), Ishikawa (1982), Juran (1988), Ott, Schilling, and Neubauer (2005), or the classic Western Electric Handbook (1958). The interested reader involved in monitoring and improving process reliability should consult these references. EXAMPLE 11.13 CUMULATIVE COUNT CONTROL CHARTS FOR LOW PPM In reliability and quality work, we often encounter situations involving low defect proportions. The typical p-type control chart becomes ineffective in a low PPM environment, because most samples of reasonable size yield zero defects, and a chart that shows only zeros has rather limited utility. Also, the normal approximation to the binomial distribution is not sufficiently accurate at low PPM values. For example, consider a process averaging 100 PPM. A sample of 200 pieces is routinely drawn from the process for monitoring. The 3-sigma UCL based on the equation 0.00010 ( .9999) = 0.0022 or 22 PPM. However, one reject out of 200 pieces above is 0.0001+ 3 200 is 0.0050 fraction defective or 50 PPM. Thus, any reject will cause the process to be considered “out-of-control.” Yet, an exact calculation for the binomial shows that the probability of at least one failure for n = 200, p = .0001 is about 2%, and not roughly one chance in 1000 as implied by 3-sigma limits. A superior approach concentrates on the number of good items produced instead of focusing on the number of defective units observed. Called the cumulative count control chart, the method is

412

Applied Reliability

described by Calvin (1983) and Fasser and Brettner (1992). Basically, the accumulated number of good units from each inspection is plotted on a vertical logarithmic scale versus time. The upper and lower control limits are exact binomial limits based on the process average and are set to provide a confidence interval for zero rejects such that an accumulated point of good units will fall inside the limits when a process is under control. For example, suppose a process is averaging 100 PPM. Then, for a sample of size 513 drawn from this process, binomial distribution calculations show there is a 5% chance of obtaining at least one defect and 95% chance of zero defects; for a sample of size 29,956, there is a 95% chance of getting at least one defect and a 5% chance of zero defects. We say the 90% control limit band is from 513 to 29,956. So if the number of good units accumulated is less than 513 before the first defect is observed, the data does not support a 100 PPM process capability. Similarly, if the UCL of 29,956 is exceeded by the accumulated number of good units, then we state that the process is significantly better than 100 PPM. Thus, out-of-control sequences are immediately apparent. If the accumulated number of good units is between 513 and 29,956 when the first reject occurs, we accept the results as consistent with a process at 100 PPM. We then reset the accumulated count to zero and begin accumulating good units again. An example graph for a 100 PPM process average is shown as Figure 11.15. Table 11.13 shows minimum sample sizes for various PPM’s versus the associated probabilities of zero rejects. The centerline of the control chart is set at the median (50%) probability level. Alternatively, if the process average defect level is p, the average number of good units between defects is 1/p. Then, the centerline is at 0.6931/p. For example, if p = 100 PPM, the expected number of good units between defects is 10,000 and the centerline is placed at 6,931. The lower ln(1− α ) ln(α ) control limit is set at nLCL = and the upper control limit is set at nUCL = , where α is ln(1− p) ln(1− p) the probability of a count less than the LCL or above the UCL. In our example, we used α = 0.05 for a 90% control limit band, p = 100 PPM, and found nLCL = 513 and nUCL = 29, 956.

Process average = 100 PPM

100000

90% control limits UCL = 29956

10000

CL = 6931

1000

LCL = 513

0/970

0/3310

3320

6630

0/550

0/1800

0/1570

0/1250

0/2350

0/550

0/750

0/2200

10

0/1100

100

0/340

Cumulative count nondetective

EXERCISE 11.43 Determine the control limits for a cumulative count control chart for a process at 20 PPM. What is the expected number of good units between defects? Determine the centerline. Assume α = 0.01.

Cumulative count FIGURE 11.15 Cumulative count control.

550

2350

10110

8540

7290

4940

4390

3640

1440

340

Reject/SS per inspection period

Minimum Sample Sizes for Zero Rejects at Various Probabilities Probability of Zero Occurrences PPM 0.999 0.995 1 5 10 20 50 75 100 200 500 1000

0.99

0.975

0.95

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.05

0.025

0.01

0.005

0.001

1000

5013

200

1003

10,050 25,318 51,293 105,360 223,143 356,675 510,825 693,147 916,290 1,203,972 1,609,437 2,302,584 2,995,731 3,688,878 4,605,168 5,298,315 6,907,752 2010

5064 10,259

21,072

44,629

71,335 102,165 138,629 183,258

240,794

321,887

460,516

599,145

737,774

921,032

1059661 1,381,548

100

501

1005

2532

5129

10,536

22,314

35,667

51,082

69,314

91,629

120,397

160,943

230,257

299,572

368,886

460,515

529,829

690,772

50

251

503

1266

2565

5268

11,157

17,834

25,541

34,657

45,814

60,198

80,471

115,128

149,785

184,442

230,256

264,913

345,384

20

100

201

506

1026

2107

4463

7133

10,216

13,863

18,325

24,079

32,188

46,051

59,913

73,776

92,101

105,964

138,152

13

67

134

338

684

1405

2975

4755

6811

9242

12,217

16,052

21,458

30,700

39,942

49,183

61,400

70,642

92,100

10

50

100

253

513

1054

2231

3567

5108

6931

9162

12,039

16,094

23,025

29,956

36,887

46,049

52,981

69,074

5

25

50

127

256

527

1116

1783

2554

3465

4581

6019

8046

11,512

14,977

18,443

23,024

26,489

34,535

2

10

20

51

103

211

446

713

1021

1386

1832

2407

3218

4604

5990

7376

9208

10,594

13,812

1

5

10

25

51

105

223

356

511

693

916

1203

1609

2301

2994

3687

4603

5296

6904

Quality Control in Reliability: Applications of Discrete Distributions

TABLE 11.13

413

414

Applied Reliability

11.17 Summary In this chapter, we considered the relationship of quality control concepts to reliability problems. Starting with simple ideas on permutations and combinations, we developed the binomial distribution. We talked about how binomial estimates could be used to nonparametrically provide probability information in the absence of knowledge about the underlying distribution of fail times. We considered the simulation of binomial random variables. We also discussed other discrete distributions: the geometric, the negative binomial, the hypergeometric, and the Poisson. We showed the correspondence between the discrete Poisson and the continuous exponential distributions. We demonstrated the application of the hypergeometric distribution in Fisher’s exact test. We discussed various types of sampling, quality concepts such as AQL, LTPD, AOQ, and AOQL, associated risks, properties of OC curves, and the selection of sample sizes. We illustrated many procedures, graphical and otherwise, to simplify the generation, implementation, and comparison of various sampling plans. The application of minimum and near-minimum sampling plans was covered. The relation of failure rates to sampling considerations was highlighted. Finally, we showed how statistical process control could be applied to the monitoring of reliability situations. In particular, we addressed the issue of monitoring low PPM situations using cumulative count control charts.

Problems

1. Calculate the number of permutations of four objects: A, B, C, D. List all permutations. 2. Calculate the number of permutations of six objects: A, B, C, D, E, F taken two at a time. List all permutations. 3. Calculate the number of combinations of five objects A, B, C, D, E taken three at a time. List all combinations. 4. One hundred devices are placed on stress for 10,000 hours. The probability of a device failing by 1000 hours is 0.02 or 2%. Assuming the devices on stress are from the same population: a. What is the probability that all devices survive 1000 hours? b. What is the probability that exactly two devices fail by 1000 hours? c. What is the probability of at least two devices failing? 5. We run a test with 50 devices. At the end of the test, there are 15 failures. What is a 90% confidence interval for the population fraction defective? What is an 80% confidence interval? Explain the difference. 6. In failure analysis, we have 20 devices to analyze. We are looking for a specific failure mechanism. If there are exactly five devices in the 20 with the specific mechanism, what is the probability that we get none with that mechanisms in a random sample analysis of eight devices?

Quality Control in Reliability: Applications of Discrete Distributions

415

7. There are 5000 devices placed on stress. If the failure probability for a device at the end of test is .0002 (i.e., .02%), what is the probability that there are no failures? At least one failure? 8. 50 devices are placed on stress for 168 hours. The constant failure rate is 30% per K hour during the interval on stress. Using the Poisson distribution: a. What is the expected number of failures? b. What is the probability that all devices survive 168 hours? c. What is the probability of at least one failure? Compare with the results in Exercise 11.4. 9. 150 devices are stressed for 1000 hours. There are no failures. Estimate an upper 95% confidence limit on the population percent defective. 10. 100 devices are stressed for 500 hours. Two failures occur. Estimate a 95% confidence interval on the population percent defective. 11. We wish to generate a sampling plan with a 95% probability of acceptance of product that is 1% defective and a 90% probability of rejecting 10% defective product. Find the sample size n and the acceptance number c. 12. What is the minimum size of a sample to be tested to assure with 90% confidence that a lot having 7000 PPM defective will not be accepted? What is the sample size if the acceptance number c = 3? 13. Sketch an OC curve for n = 1000, c = 0. 14. Sketch an OC curve for n = 500, c = 1. 15. We are given a sampling plan with n = 500 and c = 2. Determine the AQL (95%) and the LTPD (10%). Determine the sample size necessary for the same LTPD, but with the acceptance number c = 1. Check your results using Table 11.10. 16. What is the minimum sample size to protect against accepting a fraction defective higher than 500 PPM with 90% confidence? 17. What is the nearly minimum (c = 1) sample size to protect against accepting a fraction defective higher than 250 PPM with 95% confidence? 18. If the historical process average is 500 PPM, calculate the upper control limit for a periodic sample size of 1000. In an SPC mode, what happens if you get two rejects in a sample of size 1000? What is the probability of getting two or more rejects in a sample of size 1000? Explain the difference in the two approaches. 19. Construct a cumulative count control chart for a process average of 250 PPM. Draw the control limits and the centerline. Assume α = 0.05.

This page intentionally left blank

12 Repairable Systems Part I: Nonparametric Analysis and Renewal Processes In all previous chapters, we focused on the reliability of nonrepairable components such as lightbulbs or integrated circuits. The working assumption was that times to failure were a truly random sample of independent and identically distributed (i.i.d.) observations from a single population. Consequently, individual failure times could be combined for analysis, neglecting any order of occurrence of the original data. For example, suppose limited oven space requires us to divide a sample of units for stress into two groups and run each group in separate weeks. When both groups have completed the stress (under identical conditions), we reasonably assume that we can combine the data for analysis irrespective of the first group’s failure times occurring before the failure times of the second group. However, there are many common situations in which the occurrence order of failure times has significance. Consider an action that restores a failed system or process to operation, for example, replacing a circuit board fixes a computer, changing a clogged gas line filter repairs an automobile, resuming electrical power to return a factory to production following a utility failure, or calling a server reconnects an interrupted video-teleconferencing session. In all these instances, the failures occur sequentially in time, and assumptions of a single population distribution and independence for the times between failures may be invalid. If the i.i.d. properties hold for the times between failures, the repair rate is stable, and the renewal process briefly mentioned in Chapter 2 is an appropriate model. This chapter focuses mainly on the characteristics and features of renewal processes. Conversely, if there is evidence of a trend such as varying (e.g., improving or worsening) repair rates, the renewal model no longer applies. Then, assessing and modeling the system behavior requires consideration of the patterns of the sequential repair times. Chapter 13 covers the subject of nonrenewal processes in detail. Chapters 12 and 13 present key concepts on analyzing data from systems undergoing repair. The topics include useful methods, both nonparametric and parametric, for describing and modeling such data. Verification of assumptions for analysis is stressed. To enhance comprehension, simulation techniques are provided. We cover many important and practical issues involving the reliability of repairable systems in this chapter.

12.1 Repairable versus Nonrepairable Systems The reliability literature extensively covers the treatment of nonrepairable units. However, these analysis procedures are typically not suitable for analyzing data from repairable systems except for special situations such as renewal processes. Also, the confusing use of similar terminology among authors describing very different repairable and nonrepairable 417

418

Applied Reliability

situations has muddled otherwise easy-to-understand concepts. Ascher and Feingold (1984, page 1) point out that, with few exceptions, “the entire area of repairable systems has been seriously neglected in the reliability literature” even though “most real world systems are intended to be repaired rather than replaced after failure.” Consider an operating system—at some point in time—that fails to perform an expected function. (The concept of repairability also can be applied to a process, such as assembly operations in a factory. For example, if production is halted by a power outage, the repair returns the manufacturing process to full operations. However, for simplicity, we shall use the term “system” for either possibility.) The system is repairable if—short of replacing the entire system—it can be restored to satisfactory operation by any action, including replacement of components, changes to adjustable settings, swapping of parts, or even a sharp blow with a hammer. Obviously, a television set is a repairable system, which, upon failure, may be fixed possibly by replacing a failed component such as a resistor, capacitor, or transistor, or by adjustments to specific settings. Automobiles and network servers are also familiar types of repairable systems. Understanding repairable system behavior is important for many reasons. Obviously, customer satisfaction in the availability of a system to perform its functions is a key consideration. Maintenance schedules and spare parts provisions are affected by expected repair rates. Detection of trends is important to assure suitable performance, especially when health and safety issues are involved. In development stages, information obtained on the first prototype systems—often only a single costly system is available for study— may be used for reliability improvement of later ones or for designing better future systems. (For an example of such an application, see Usher 1993.) Conversely, in production mode, where many copies of a system are produced, the objective may be to estimate the repair rate of the population of systems in field usage. Issues of concern may be specifying burn-in and measuring its effectiveness, providing for spare parts, forecasting repair and warranty costs, or establishing a preventive maintenance schedule. (For an illustration of a comparison of two levels of in-house testing to reduce early life failures prior to shipment of systems to customer sites, see Zaino and Berke 1992.) For a single-component system, if repair occurs by an action that apparently restores the system to “like new” condition (such as a simple replacement of the failed component with a new component from the same population), then a renewal process seems likely. In such situations, it may be reasonable to assume that the times between successive repairs are independent and from the same component life distribution. However, replacement of a failed component, by itself, does not necessarily assure a renewal process. For example, Usher (1993) describes a system repaired by the replacement of a component with an identical unit from the same population. Yet, because the cooling unit of the system was degrading, the temperature inside the system was gradually rising, and the times between consecutive fails became shorter and shorter. The result was increasing repair rates, thereby negating any possibility of a renewal process. Under a renewal model, a single distribution characterizes the independent times between failures, and the repair frequency appears fairly constant. If the frequency of system repairs is increasing or decreasing with age—indicating deterioration or improvement—then we have nonrenewal behavior. For a nonstable repair rate, maintenance costs and provisions for spare parts are expected to vary in time. Consequently, it is important to distinguish renewal processes from general repair processes. Renewal processes permit simplification of analysis. Nonrenewal processes are more complicated because of nonstationary characteristics. This chapter will focus on the renewal situation, and nonrenewal processes will be covered in Chapter 13.

419

Repairable Systems Part I

The discussion of renewal processes begins with some simple graphical procedures applied to single repairable systems. Next is the analysis of many systems with different operating times (multicensored data). The graphical results motivate analytical techniques. Time, that is, the system age defined as the total running hours starting at installation turnon, will be the primary metric for repairable systems, but other measures such as miles or cycles between repairs may be used. The methods shown easily extend to these alternative scales.

12.2 Graphical Analysis of a Renewal Process Many statistical procedures exist for analyzing data from repairable systems, but analysts will gain valuable insight into system performance by simple plots of the data. Such plots generally indicate the analytical methods that are most appropriate. Graphical techniques may reveal trends and suggest models. It is strongly recommended that the first step in the analysis of any repair data be the plotting of the observations. We will present several chart types that are particularly useful in studying repairable system performance. Consider a single system for which the times involved in making repairs are assumed negligible and can be ignored. This simplification is reasonable since repair times are typically very small compared with a system’s operating hours. In Section 12.10, when we discuss the topic of availability, we shall incorporate the effects of repair times into the analysis. We observe 10 repair events at the following system ages in hours: 106, 132, 289, 309, 352, 407, 523, 544, 611, and 660. A line sketch (called a “dot plot”) shows the pattern of repairs (Figure 12.1). How do we treat such data to answer various questions about repair behavior? For example, “Is there any indication of the system getting better or worse or staying fairly stable?” An important measure of reliability for a repairable system is the cumulative number of failures N(t) that occur on the system by time t. For any t, the function N(t) is a random variable. A stochastic process is a collection of random variables that describes the evolution in time of a process. The total number of repairs N(t) occurring by time t for a system is an example of a stochastic counting process. A common data graph is the cumulative plot: the cumulative number of repairs is plotted against the system age t. A sample cumulative plot of a system is viewed as single observation from a population of possible curves. The population average cumulative number of repairs at age t is called the “mean cumulative function,” or MCF, and is denoted by M(t). We see that the MCF is the expected number of repairs per system by time t. Thus, M(t) = E{Ni(t)} for system i. Figure 12.2 is a cumulative plot of the above data. The repair history plots as a staircase function, with a step rise at each repair, using either connected or unconnected points. The stepwise nature of the plot is made evident here, but the stairs may be omitted for simplicity in subsequent plots.

0

FIGURE 12.1 Dot plot of repair pattern.

100

200 300 400 500 System age (hours)

600

700

420

Applied Reliability

10 Cumulative number of repairs

9 8 7 6 5 4 3 2 1 0

0

100

200

300 400 System age (hours)

500

600

700

FIGURE 12.2 Cumulative plot.

A linear appearance indicates the repair rate is basically constant. Alternatively, curvature in this plot can reveal whether the system reliability is improving or deteriorating with age. For a single, randomly chosen system, the actual number of repairs by time t provides an unbiased estimate of the population mean number of failures per system M(t), which ˆ (t). The plot is a sample observation from a potential population of such we denote by M curves. Note, however, that there are many situations in industry in which only one system is available (e.g., development activity using a prototype). Thus, the entire “population” consists only of that single system, but a conceptual population may be future versions of the system. Figure 12.2 shows that the cumulative number of repairs versus time appears to be reasonably linear. That is, the total number of repairs is proportional to age t, and the rate of repairs (i.e., the derivative m(t) = dM(t)/dt) is fairly constant in time. The repair process is called stationary because there is no apparent trend associated with system improvement or deterioration. This stationary characteristic suggests the possibility of a renewal process, that is, i.i.d. times between failures (see Cox 1962, for further discussion). For a renewal process, the graph will necessarily appear linear over time. Thus, a reasonable straight line fit to the cumulative plot can serve as a preliminary visual indication of a renewal process. However, an observation interval that is too short may not be sufficient to reveal any pattern. We shall discuss general analytical procedures starting with Section 12.6, but for now, let us assume a renewal process and analyze the data accordingly. Under a renewal process, the times between failures are i.i.d. Let the successive ages at failure be denoted by t j , j = 1, 2 , … , k . Let X 1 be the time to the first failure, that is, X 1 = t1; let X 2 be the time between the first and second failure, that is, X 2 = t2 − t1 and so on. The values of X i , called interarrival times, are then 106, 26, 157, 20, 43, 55, 116, 21, 67, and 49. Similar to statistical process control charts, a second useful and revealing graph type for repairable systems, especially for detecting trends, is a plot of the interarrival times against the system age at repair events. Figure 12.3 is such a graph, illustrating that there is no discernible trend in the data, as expected under a renewal process. An alternative approach would be create a recurrence rate (RR) graph based on a plot of a scaled (e.g., annualized)

421

Repairable Systems Part I

180 160 140

Xi (hours)

120 100 80 60 40 20 0

0

100

200

300 400 System age (hours)

200

300 400 System age (hours)

500

600

700

FIGURE 12.3 Interarrival times versus system age.

0.06

1/Xi (repairs/hour)

0.05 0.04 0.03 0.02 0.01 0

0

100

500

600

700

FIGURE 12.4 Recurrence rate versus system age.

reciprocal of the interarrival times versus the system age at repair. Figure 12.4 shows an RR plot for these data. Again, no trend is observable. Because of the assumed renewal process, the X i can now be treated as a sample of 10 i.i.d. observations from the same population, just as if they were derived from 10 nonrepairable units starting stress together at time zero. For example, using the probability plotting approach (Chapter 6), we can sort the data from smallest to largest, assign median rank plotting positions, and plot the data on an appropriate probability paper, such as Weibull or lognormal. Alternatively, we can use maximum likelihood procedures to estimate the population parameters. In doing so, we assume the interarrival times are random variables from the same, single population. Based on physical considerations for this system, we fit a lognormal distribution to times between repairs. A lognormal probability plot of the interarrival times (Figure 12.5) supports the supposition.

422

Applied Reliability

Interarrival times (hours)

1000

100

10

1

1

7

23

50 77 Cumulative percentage

93

99

FIGURE 12.5 Lognormal probability plot.

For estimating distribution parameters, renewal data from a single repairable system are complete (i.e., the data are noncensored up to the time of the last repair event). If we have some running time since the last repair, we should include the data as a single censored observation. For the lognormal distribution, we use a relatively simple method to estimate the parameters for complete data (see discussion in Chapter 5). To estimate T50, we use the mean ˆ of the logarithms of the failure times, that is, T50 = e ˆ

where the mean μ is estimated from the average ˆ of the logarithms of the 10 times between failures. Thus,

{

}

n

1 Tˆ50 = exp ∑ ln X i/n = ( X 1X 2 X n ) n = (106 × 26 × × 49)1/10 = 52.65. i=1

Note that Tˆ50 is called a geometric mean of the times (i.e., a geometric mean of n values is the nth root of the product of the n values). Similarly, the shape parameter σ for the lognormal distribution may be estimated from the standard deviation of the logarithms of the interarrival times. The shape parameter σ is estimated using the equation n

∑ ( ln X − ˆ )

2

i

σˆ =

=

i=1

n−1 (ln 106 − ln 52.65)2 + (ln 26 − ln 52.65)2 + (ln 49 − ln 52.65)2 9

= 0.72.

423

Repairable Systems Part I

1.00 0.90 0.80 0.70

CDF

0.60 Model

0.50 0.40 0.30 0.20 0.10 0.00

0

20

40

60

80 100 120 Interarrival times (hours)

140

160

180

FIGURE 12.6 CDF model fit versus observed.

It is always a good idea to plot the fitted model against the empirical cumulative distribution function (CDF) to assess the adequacy of the fit. Figure 12.6 is such a plot. The fit appears reasonable. Having a model allows us to compare actual results to anticipated results; to estimate the renewal rate with a smooth function; to predict not only the future behavior of the observed system but also the expected performance of the population of similar systems; and to potentially gain insight into the repair causes. In addition, for a single distribution, we can express reliability in terms of a mean time between failure, MTBF, for planning purposes or comparison to objectives. Finally, a possible model also allows us to apply statistical procedures specific to the model, including tests of parameters, assumptions, and so on. With a model, we could also predict possible future repair scenarios for similar renewal processes by simulating interarrival times by using randomly generated variables from a lognormal distribution having the estimated parameters. Each simulation of a set of interarrival times would represent one possible cumulative plot from the population of all possible such plots. The distribution of the plots at any system age t would suggest a degree of uncertainty in the expected number of repairs for any extrapolation. EXERCISE 12.1 The following 10 consecutive repair times are recorded for a system: 47, 90, 180, 208, 356, 377, 399, 461, 477, and 652. Make a cumulative plot. Is there evidence of a renewal process? Make a probability plot assuming a lognormal distribution. Estimate the parameters T50 and σ. Check the model distribution fit against the empirical CDF.

424

Applied Reliability

12.3 Analysis of a Sample of Repairable Systems As we have shown, the cumulative plot for a single system depicts the cumulative number of repair events versus the system age. However, we may be concerned with the overall behavior of many systems (e.g., in a datacenter). Alternatively, we may be interested in systems consisting of many identical independent subsystems (sometimes called “sockets” in the reliability literature) for which the system behavior is the combined repair events of all subsystems. In the case of many identical systems, we may be interested in the overall performance of systems where possibly each system may have a different operating time under use conditions, resulting in multicensored data (see discussion of multicensored data in Chapter 7, Section 7.1). To illustrate, consider a manufacturer of televisions. The company may collect sample data on the repair frequency of a new model type installed over the last calendar year. Some televisions will have been in the field for nearly the full year, and others may have only recently been purchased and placed in use. How do we estimate the mean repair rate for all televisions? What is the variation in the mean number of repairs for all televisions in the field at a given time? What is the expected time to the first repair? To the kth repair? Are the costs of repairs increasing or decreasing? What is the mean repair cost? Are spare parts adequate? Is burn-in or run-in (i.e., burn-in at operating conditions) necessary? How long should burn-in be done? How cost-effective is burn-in? Both renewal and nonrenewal processes may involve multiple systems. Several analyses of multisystem or multisocket data have appeared in the reliability literature. One graphical tool involves comparing distributions of interarrival times or some other numerical measure for the nth repair in a system as measured across many systems. For example, histograms of the times to first repair among the systems can be compared with histograms of the times between first and second repairs, between second and third repairs, and so on. In this way, we are checking for trends and whether the various interarrival times are identically distributed. Obviously, this approach requires many systems for generating empirical distributions. This method was used by Davis (1952) to analyze the number of miles between successive major failures of bus engines. Davis generated histograms showing interarrival miles to the first failure, to the second failure, and so on. In addition to depicting the average miles between the (i − 1) and the ith repairs and displaying the variation in the data, the shape of the histograms provided useful information for investigating the repair frequency. In this study, Davis found average interrepair times to be decreasing, early interarrival times were nearly normally distributed, and later interarrival times were distributed roughly exponentially (see Ascher and Feingold 1984, page 86, for further discussion and display of the histograms from Davis). Although the histogram approach has value, one requires considerable data on repairs for valid comparisons. Also, the method does not provide a model for the mean number of repairs, repair rate, or other statistical measures as a function of, for example, time or miles. Another direction proposed by O’Connor (2002) is to display an event plot, a graphical matrix structure in which each row represents the system repair history in a format similar to a dot plot with running time for the horizontal scale. In this manner, one can study the incidence of repairs against overhaul actions and across various systems, looking for possible patterns of common behavior. Again, there is no associated method provided for model development or testing, and multicensored data may limit the comparisons among systems. The method developed by Nelson (1988b, 2003) and extended by Nelson and Doganaksoy (1989) also is graphical, but it has several advantages compared with other approaches. Nelson provides simple and informative plots of censored data on numbers and costs of

425

Repairable Systems Part I

repairs on a sample of systems. In addition, he shows how to generate a nonparametric graphical estimate of the mean cumulative cost or number of repairs per system versus age, along with the estimation of the variance and confidence limits. We shall illustrate Nelson’s procedures later in this section, after first discussing some elementary concepts. Consider a population of systems subject to repair actions. We can represent individual repair histories using cumulative plots and connecting lines between repairs. A composite graph of the cumulative histories for all systems in the population could easily be drawn. For systems installed at different dates and consequently having different ages at the point of analysis, the referencing of all systems back to time zero would generate multicensored data. Each system repair history curve would extend only as far as the censoring time for that system. As an example, Table 12.1 gives the repair and random censoring times for five identical systems operating at a datacenter. An event plot of the data is shown as Figure 12.7. Figure 12.8 is a combined plot of the five system histories (cumulative plots). The lines, normally displayed as staircase functions, are shown connected for easier viewing. As mentioned previously, we can envision a single curve, denoted by M(t), that gives the average or expected cumulative number of repairs per system at age t. M(t) is called the mean cumulative (repair) function (MCF). Note that M(t) can be viewed as a vertical “slice” in Figure 12.8 that averages the system histories at time t. (In contrast, a horizontal “slice” would relate to a parameter such as the mean time to the ith repair.) How do we estimate M(t) for multicensored data? Since the cumulative population repair function M(t) is defined as the population mean cumulative number of repairs per system by time t, an unbiased estimator of M(t), for a sample single system, is just the total repair count by time t. Similarly, ˆ (t) = average cumulative number of for a collection of systems, an unbiased estimator is M repairs among all systems by time t, assuming no censoring prior to time t. TABLE 12.1 Repair Age Histories (Hours) Repair Number

System 1

System 2

System 3

System 4

System 5

1 2 3 4 5 Censoring Times

222 584 985 1161

273 766 1054

125 323

63 195 325

1901

1316

442

636

91 427 761 1096 1796 2214

5

System

4 3 2 1 0

500

FIGURE 12.7 Event plot of repair histories for five systems.

1000 1500 Age (hours)

2000

2500

426

Applied Reliability

Cumulative number of repairs

5

4

3

2

1

0

0

500

1000 1500 System age (hours)

2000

2500

FIGURE 12.8 Repair history (cumulative plots) for five systems.

System 1

System 2

Age in hours

0

50

100 = repair

150

200

= censoring time

FIGURE 12.9 Repair history for two systems.

For multiple systems, at times when one or more systems have been censored, we need an estimator for M(t). Trindade and Haugh (1979) describe what they call a pooled summation estimator for M(t) applied to field renewal data. A similar estimator was developed for the general repair situation by Nelson (1988a, 1988b, 2003). First, consider the case of two systems. We want to estimate M(t) for t greater than the censoring time τ1 of system 1. The pooled summation approach uses an unbiased estimator with some desirable properties, including smaller variance than other methods under censoring. For our example, we average counts across both systems to estimate M(t) for t ≤ τ1 and then add the average (for n = 1) number of repairs on system 2 for t > τ1 to the ˆ (t) at the censoring time τ . Thus, for t > τ , M ˆ (t) = average number of repairs by estimate M 1 1 age τ1 on both systems plus the average number of additional repairs on system 2 occurring in the interval from τ1 to t.

427

Repairable Systems Part I

To illustrate, suppose system 1 has censoring time of τ1 = 100 hours and repairs at 50 and 90 hours. Assume system 2 has repairs at 60, 120, and 175 hours and a censoring time τ 2 = 2 00 hours (Figure 12.9). At the repair times, t = 50, 60, 90, 120, and 175 hours, the pooled summation approach yields

ˆ (50) = 1/2 = 0.5, M ˆ (60) = 2/2 = 1, M ˆ (90) = 3/2 = 1.5, M ˆ (120) = 3/2 + 1/1 = 2.5, and M ˆ (175) = 3/2 + 1/1 + 1/1 = 3.5. M

ˆ (t ) the average cumuTo extend to more than two systems, we add to the latest estimate M i lative number of repairs occurring since the repair time ti on only the systems uncensored at ti . Note that the pooled summation estimators are step functions at the repair times. The pooled estimator is nondecreasing in time, a desirable property not necessarily possessed by other estimators. Each time a system is censored, the pooled summation estimator is anchored at that point and subsequent repairs are summed to the previously anchored value. This methodology is described in detail by Nelson in his 1986 report, subsequent papers, and in his book (2003) on the graphical analysis of system repair data. Nelson has extended the concepts to estimate mean repair costs per system; in addition, he details the properties of the M(t) estimator. EXAMPLE 12.1 THE MEAN CUMULATIVE FUNCTION Using the data in Table 12.1, estimate the MCF. Plot the estimate with the original system repair histories.

Solution We begin by displaying an event plot (ordered by systems with longest to shortest times) of the repair and censoring times from all systems (Figure 12.10). Since the earliest censoring time occurs at 442 hours, the estimate of M(t) at any system age prior to this time is just the total number of ˆ (63) = 15 ˆ (91) = 2/5 = 0.4, … , M ˆ (427) = 9/5 = 1.8. repairs, by age t, divided by 5. Thus, M / = 0.2,M ˆ By the 10th repair, one system is censored, and M(584) = 1.8 + 14 / = 2.05. At 636 hours, another ˆ (761) = 2.05 + 13 system is censored, and at the 11th repair, with only three systems at risk, M / = 2.38. ˆ ˆ Similarly, M(766) = 2.05 + 2/3 = 2.72, … , M(1161) = 2.05 + 6/3 = 4.05. For the final repair, ˆ (1796) = 4.05 + 0.5 = 4.55. The estimated MCF is plotted as M(t) in Figure 12.11. M

System 5 System 1 System 2 System 4 System 3 Age in hours

0

500

1000 = repair

FIGURE 12.10 Repair history for five systems.

1500

2000

= censoring time

2500

428

Applied Reliability

5

Cumulative number of repairs

M(t) 4

3

2

1

0

0

500

1000 1500 System age (hours)

2000

2500

FIGURE 12.11 Mean cumulative repair function.

12.3.1 Solution Using Spreadsheet Methods Figure 12.12 illustrates the estimation of the MCF using a spreadsheet. For simplicity, we use a discrete time framework, in which the events (repairs or censoring) may occur at a time index t, t = 0, 1, 2, 3, … as shown in the first column. The second column is the age of each event sorted by increasing age, and the third column is the type of event, where we have used “F” for failure or repair and “C” for a censored age. The fourth column is labeled “rt” and represents the systems at risk at the index time t of the event. Starting in row 3, the values in the r t column can be found by inserting the formula IF(C2=”C”,D2-1,D2) in the D3 cell and then dragging using autofill to complete the column. The fourth column is labeled mt = 1/rt and represents the incremental average number of events that occur at time t. Here, we have discrete and unique events, and consequently, mt is 1 divided by the number of systems at risk rt. Again, starting in row 3, the values in the mt column can be found by entering IF(C3=”F”,1/D3,0) in cell E3 and dragging down to autofill the column. Finally, the last column, labeled Mt = Mt−1 + mt, is the MCF column formed by sequentially adding the values in the mt column. Thus, we see that the MCF t

M(t) = Mt =

∑m

i

i= 0

An MCF curve can now be created by plotting the Mt column versus the age column. EXERCISE 12.2 Field repair history has been captured on three sample systems. The first, with 500 hours of total operating time, had two repairs at 145 and 368 hours. A second, with 300 hours of operation, had a repair at 247 hours. The third, with 700 hours of use, had three

429

Repairable Systems Part I

repairs at 33, 318, and 582 hours. Using the pooled summation approach, estimate M(t) at each event time. EXERCISE 12.3 Table 12.2 shows repair histories recorded on four identical systems, each experiencing five repair events prior to reaching the respective censoring age. Create an event plot of the four systems, make a combined cumulative plot, estimate the MCF, and create an M(t) plot.

FIGURE 12.12 Spreadsheet method for estimating the MCF.

TABLE 12.2 Repair Histories for Four Machines Repairs 1 2 3 4 5 Censor

Machine A

Machine B

Machine C

Machine D

426

1563

450

2044

523 2814 4551 5328 6000

2230 3489 3748 4122 4500

1149 1262 1355 2319 2500

3151 4693 4935 7254 7500

430

Applied Reliability

12.4 Confidence Limits for the Mean Cumulative Function (Exact Age Data) We next consider the estimation of confidence limits for the MCF. A suitable reference on the topic of confidence limits is the expository paper by Nelson (1995). Nelson and Doganaksoy (1989) presented a computer program for the calculation of confidence intervals for the MCF. We will use an approach in the paper by Lawless and Nadeau (1995). However, we begin considering a very important point in the estimation of confidence limits for the MCF. In our examples, the repair or censoring events are identified with specific systems (i.e., system ID is provided). When we have system identification (often called SYS_ID), we can calculate true confidence limits for the MCF. We will describe the approach in this section, but first we need to discuss a very common situation in which the ages of repair and censoring events are available, but the SYS_ID is not. In such a situation, it is possible to estimate the MCF along with what Nelson (2003) calls “naive confidence limits.” We see in Figure 12.11 that we have not used information concerning the specific system ID to estimate M(t). As described by Nelson (2003), since the MCF estimate is the sum of increments mt , then the variance of the MCF estimate is the sum of variances of the increments. The variance of the increments is estimated by v(mt ) = mt/rt = 1/rt2. Then, assuming that the estimator of the MCF has a sampling distribution approximately normal based on the central limit theorem (see Chapter 5), Nelson uses a normal distribution approximation to specify the limits. EXAMPLE 12.2 NAIVE CONFIDENCE LIMITS FOR THE MCF Using the data in Table 12.1, estimate the pointwise 95% naive confidence limits for the MCF. Plot the MCF estimate with the confidence limits.

Solution Using Spreadsheet Methods We have added four columns to Figure 12.12 as shown in Figure 12.13. In the G column, labeled v (mt ) = mt /rt , we estimate the variance of the increments by dividing the values in column E by the respective values in column D. The formula in cell G3 is E3/D3. Column H, labeled v (Mt ) = v (Mt −1) + v (mt ), is the cumulative sums of the values in column G. The formula in cell H3 is H2+G3. Finally, for a confidence level of 95%, the lower confidence limits (LCLs) are shown in column I and the upper confidence limits (UCLs) in column J. The formula in cell I3 is F3−1.96*SQRT(H3), and the formula in J3 is F3+1.96*SQRT(H3). The MCF plot and confidence limits appear in Figure 12.14.

12.4.1 True Confidence Limits Nelson (1995, 2003) presents the assumptions and theory for the true approximate confidence limits for the MCF. We will follow the derivation presented by Lawless and Nadeau (1995), which provides variance estimates that are always positive. These limits require that each recurrence and censoring age have system identity. The notation for the theory can be daunting, and the calculation of these limits is sufficiently complicated that special programs may be needed. We will illustrate the determination of the limits for the example data using the capabilities in JMP and Minitab, but the limits can be estimated using a spreadsheet, the description of which is provided in Appendix 12A. EXAMPLE 12.3 CORRECT APPROXIMATE CONFIDENCE LIMITS FOR THE MCF Using JMP capabilities for the data in Table 12.1, estimate the pointwise 95% confidence limits for the MCF.

431

Repairable Systems Part I

FIGURE 12.13 Spreadsheet method for estimating the MCF and naive confidence limits.

7 6

MCF

5 4 3 2 1 0

0

500

FIGURE 12.14 MCF and 95% naive confidence limits.

1000 1500 System age (hours)

2000

2500

432

Applied Reliability

Solution Using JMP The JMP data table is shown in Figure 12.15. Under Type, “1” is a repair event and “0” is a censoring event. Selecting Recurrence Analysis in the Reliability and Survival platform, we enter the appropriate column roles as shown in Figure 12.16. Note that the system ID is required. Type is entered in the Cost, End-of-Service area.

FIGURE 12.15 JMP data table for recurrence analysis.

FIGURE 12.16 JMP dialog box for recurrence analysis.

433

Repairable Systems Part I

The JMP output is displayed in Figure 12.17. Note the MCF Confid Limits have been selected from the red triangle at the top of the output. Clicking the blue triangle next to MCF Table produces the table also shown in Figure 12.17.

Solution Using Minitab The Minitab worksheet is shown in Figure 12.18. For “Type,” Minitab uses the term “Retirement.” As in JMP, “1” is a repair event and “0” is a censoring event. Selecting Stat, Reliability/Survival, Repairable System Analysis, Nonparametric Growth Curve …, we then enter the appropriate column roles as shown in Figure 12.19. We also click Retirement … and complete the second dialog box in Figure 12.19. Additionally, we select Storage … to choose the desired output shown in the third box. Minitab produces the event and MCF plots shown in Figure 12.20 along with a tabular output. We note that the confidence limits provided by Minitab differ from those provided by JMP. The difference is the result of Minitab using the preferred logit transformation (as described in Chapter 7) to assure positive endpoints, while JMP uses a normal approximation that can generate negative limits.

FIGURE 12.17 JMP output for recurrence analysis.

434

FIGURE 12.18 Minitab data worksheet for repairable system analysis.

FIGURE 12.19 Minitab dialog boxes for repairable system analysis.

Applied Reliability

435

Repairable Systems Part I

Nonparametric Growth Curve: Age System: System Retirement value: Retirement = 0 Nonparametric Estimates Table of Mean Cumulative Function

Mean Cumulative Time Function

Standard Error

Lower

Upper System

63 91 125 195 222 273 323 325 427 584 761 766 985 1054 1096 1161 1796

0.178885 0.219089 0.219089 0.334664 0.282843 0.178885 0.219089 0.357771 0.334664 0.289612 0.437358 0.302130 0.289612 0.240726 0.302130 0.289612 0.428806

0.03465 0.13672 0.29332 0.35238 0.57444 0.89597 1.03020 1.03225 1.25030 1.55418 1.66335 2.18460 2.53206 2.94294 3.16927 3.52035 3.78261

1.15443 1.17026 1.22734 1.81624 1.74083 1.60720 1.90254 2.48002 2.59138 2.70400 3.41497 3.37832 3.67388 3.88963 4.35861 4.65933 5.47307

0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 1.60000 1.80000 2.05000 2.38333 2.71667 3.05000 3.38333 3.71667 4.05000 4.55000

95% Normal CI

4 5 3 4 1 2 3 4 5 1 5 2 1 2 5 1 5

FIGURE 12.20 Minitab output for analysis of five repairable systems.

12.5 Nonparametric Comparison of Two MCF Curves We are often interested in making MCF comparisons in the behavior of repairable systems that differ based on product type, customer, location, vintage, failure causes, suppliers, applications, and so on. Consider the situation where we wish to compare MCFs for two ˆ (t) and M ˆ (t). Assuming indepensample sets of repairable system data, represented by M 1 2 dent samples from the two populations, the difference between the two MCFs at time t is estimated by

ˆ (t) = M ˆ (t) − M ˆ (t) 2 1

436

Applied Reliability

For independent samples, the variance of the difference Var[ (t)] is estimated by ˆ (t)] + Var[ M ˆ (t)] Var[ ˆ (t)] = Var[ M 2 1

[ ˆ (t)] = Var ˆ (t) . Then, a normal Let the estimated standard error of the difference be se approximation for a 95% confidence interval for the difference is ˆ (t) − 1.96 × se [ ˆ (t)], ˆ (t) + 1.96 × se [ ˆ (t)]

A graphical approach involves plotting the difference versus time between the two MCF curves along with the confidence limits. If the confidence limits at any point exclude the x-axis (at y = 0), then a statistically significant difference has been shown at the specified confidence level. EXAMPLE 12.4 COMPARISON OF MCFS FOR SERVERS AT TWO DIFFERENT DATACENTERS ABC company has two datacenters: one located on the East Coast and a newer one located on the West Coast. The East Coast datacenter has five similar file servers of a particular type that have been operating for approximately 14 months, or 10,000 hours. The West Coast has four of the same type servers that have been in operation for approximately 7 months, or 5,000 hours. From the data observed, it appears that the reliability of the West Coast servers is worse than the reliability of the East Coast servers. The historical repair data for each location is shown in Table 12.3. Plot the MCFs for each location and plot the differences between MCFs with 95% confidence limits and determine if the two observed MCFs show a statistically significant difference.

Solution Figure 12.21 shows the MCFs for the East and West Coast locations and the format for estimating the differences. The standard errors have been estimated using JMP or Minitab. Figure 12.22 is plot of the MCFs for each location. A plot of the differences is shown as Figure 12.23. The graph clearly shows that a statistically significant difference appears after approximately 1000

TABLE 12.3 Repair Histories for Services at Two Different Datacenters Repair Number East Coast Location 1 2 3 Censoring Times Repair Number West Coast Location 1 2 3 Censoring Times

System 1

System 2

System 3

System 4

System 5

4995 9243

5268 6648 7994 10000

2993 5907

10000

596 1084 8189 10000

10000

5466 5802 6940 10000

System 6

System 7

System 8

System 9

123 241 3832 5000

243 897 4015 5000

995 2370 4459 5000

1085 4984 5000

437

Repairable Systems Part I

West Coast

East Coast

Age

MCF-WC

Std Er

0 123 241 243

0 0.25 0.5 0.75 0.75 1 1.25 1.25 1.5 1.75 1.75 2 2.25 2.5 2.75 2.75 2.75

0.2165 0.4330 0.4146 0.4146 0.5000 0.4146 0.4146 0.2500 0.2165 0.2165 0.3536 0.4146 0.4330 0.2165 0.2165 0.2165

897 995 1085 2370 3832 4015 4459 4984 5000

Differences

Age

MCF-EC

Std Er

Age

Difference

Std Er Diff

LCL

UCL

0

0 0 0 0 0.2 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.6 0.8 0.8

0 0 0 0 0.1789 0.1789 0.1789 0.3578 0.3578 0.3578 0.3578 0.3578 0.3578 0.3578 0.3578 0.3347 0.3347

0 123 241 243 596 897 995 1084 1085 2370 2933 3832 4015 4459 4984 4995 5000

0 0.25 0.5 0.75 0.55 0.8 1.05 0.85 1.1 1.35 1.15 1.4 1.65 1.9 2.15 1.95 1.95

0.2165 0.4330 0.4146 0.4515 0.5310 0.4515 0.5476 0.4365 0.4182 0.4182 0.5030 0.5476 0.5617 0.4182 0.3986 0.3986

−0.1744 −0.3487 −0.0626 −0.3350 −0.2408 0.1650 −0.2233 0.2445 0.5304 0.3304 0.4141 0.5767 0.7991 1.3304 1.1688 1.1688

0.6744 1.3487 1.5626 1.4350 1.8408 1.9350 1.9233 1.9555 2.1696 1.9696 2.3859 2.7233 3.0009 2.9696 2.7312 2.7312

596

1084

2933

4995

FIGURE 12.21 MCF comparison between East and West Coast locations.

3

2.5

MCF

2

1.5

1 West coast

0.5

0

East coast 0

2000

FIGURE 12.22 MCFs for East and West Coast locations.

4000

6000 Age (hours)

8000

10000

438

Applied Reliability

3.5 3

Difference in MCFs

2.5 2 1.5

Difference LCL

1

UCL

0.5 0 −0.5

0

1000

2000

3000

4000

5000

6000

−1 Age (hours) FIGURE 12.23 MCF differences between East and West Coast locations.

hours where the LCLs go above the x-axis. We see also that the comparison is possible only up to the censoring time of 5000 hours for the West Coast systems. It would not be reasonable to extend the comparison beyond that time, requiring an assumption that the MCF for the West Coast systems remains constant until the censoring time of 10,000 hours for the East Coast systems. We note that Minitab 16 has the capability in the Reliability/Survival platform, under Repairable System Analysis, Nonparametric Growth Curve …, to compare two MCFs. We need to set up a “By Variable” location column in the worksheet, check the appropriate box in the input screen, and then select the desired display outputs in Graphs … and Storage …. Similarly, JMP provides the comparison feature in the Recurrence Analysis platform by using a location column as a grouping variable. In addition to a combined plot of the two MCFs and an event plot, the red triangle has the option to plot MCF differences. The JMP output for this example is shown in Figure 12.24.

EXERCISE 12.4 Table 12.4 shows the repair histories recorded on seven identical systems at two different locations. Plot the differences between MCFs with 95% confidence limits and determine if the two observed MCFs show a statistically significant difference.

In addition to graphical tools, analytical approaches are very important for developing, testing, and applying models. When the amount of data is small, or when we want to establish formally the significance of an effect, models are especially appropriate. We will now discuss the properties, models, and analytical methods for renewal processes.

439

Repairable Systems Part I

FIGURE 12.24 JMP plot of MCF difference between East and West Coast locations.

TABLE 12.4 Repair Histories for Two Locations Repair Number

System 1

System 2

System 3

Location A 1 2 3 Censoring Times

461 1563 2054 5000

272 3457 3584 5000

503 3650

Repair Number

System 4

System 5

System 6

System 7

Location B 1 2 3 Censoring Times

608 2268 3790 10000

1267 6434

3653 4235

10000

10000

2677 3143 8690 10000

5000

440

Applied Reliability

12.6 Renewal Processes Let us briefly review and summarize the important considerations for a renewal process for repairable systems. For a single-component system, a renewal process would be expected when the replacements are always new items from the same population as the items being replaced. Consequently, the times between consecutive replacements are assumed i.i.d. Note that we shall use the terms components, units, items, and parts interchangeably, and that a single component may be considered a system. Multicomponent systems are treated later in this section. Under the renewal scenario, certain theoretical simplifications are possible, such that data analysis methods based on nonrepairable component theory are applicable. For example, times between failures may be utilized for probability plotting and graphical parameter estimation. However, without the renewal process assumption, for proper analysis, we need to use methods that take the time sequence of observations into consideration. Otherwise, misleading conclusions are possible from the use of inappropriate techniques. A system containing a single repairable component is a simple but frequently useful model. For example, a radar system may have a weak part that accounts for practically all failures during a specified period of use. The system operates until a failure occurs. The system is restored to operation (repaired) by replacement of the failed component. We assume the repair time is negligible compared to the operating time. If the replacement parts are new parts from the same population as the original parts and the replaced parts fail independently with a common failure distribution, we have a renewal process. Generally for a repair process, two random variables are of key interest: N(t), the total number of repairs by specified time t, and the total time T(k) (system age) to reach the kth repair. N(t) is a discrete random variable, but T(k) is a continuous random variable. Both variables depend on the joint distributions of times between repairs. N(t) is a random variable at any time t. The mean or expected value of N(t) is called the renewal function, denoted by M(t). (This value is the same as the mean cumulative number of repairs function per system discussed in Section 12.3.) Thus, M(t) = E[ N (t)] . Similar to the discussion for the mean repair function, when we observe the cumulative number of replacements versus time for a renewal process for a single system, we are seeing a set of data forming one sample (a realization) from the many possible for the random variable N(t). Imagine a population of curves in which the N i (t) for each system i is plotted versus time t. Then, the mean of all possible N i (t) curves is the renewal function M(t). For a renewal process, a single distribution of failure times defines the expected pattern (presuming the same distribution applies to all systems is an explicit assumption that needs to be verified). Let X i denote the operating time between repairs, that is, X i is the interarrival time between the (i − 1) and the ith repairs, where X 0 = 0. Equivalently, X i = T (i) − T (i − 1), where T(0) = 0. We can express N(t), the number of repairs by time t, as a function of the X i. Thus, N(t) is the maximum integer k such that the sum of the first k interarrival times is less than or equal to t, that is, N(t) = k implies that

∑

k

i=1

X i ≤ t and

∑

k +1

i=1

Xi > t .

For example, suppose we have a system with repairs at ages 10, 15, 27, 49, and 62 hours. The times between repairs are then 10, 5, 12, 22, and 13 hours, respectively. The observed number of repairs N(30) at t = 30 hours is then three since the sum of the three interarrival times, that is 10 + 5 + 12, is less than t = 30 hours, and the sum of the first four interarrival times is, 10 + 5 + 12 + 22 = 49 hours, which is greater than t = 30 hours.

441

Repairable Systems Part I

We see that N(t) = 0 if the first repair time X 1 > t . Similarly, the time to the kth replacement can be written in terms of the X i as k

T (k ) =

∑X

i

i=1

If we know the probability distribution of X i, we can (at least theoretically) determine the distributions for N(t) and T ( k ), along with the renewal function M(t) and the corresponding population rate of renewal (i.e., RR), dM(t)/dt.

12.7 Homogeneous Poisson Process Suppose the interarrival times X i are independent and exponentially distributed with failure rate λ, that is, the times have a PDF

f ( x) = λe − λx

0≤x

Then, the probability distributions of N(t) and T ( k ) are easy to determine. Barlow and Proschan (1975) show that N(t) has a Poisson distribution with mean rate (or intensity) λ. So, the probability of observing exactly N(t) = k replacements in the interval (0,t) is

P[ N (t) = k ] =

(λt)k e − λt k!

A renewal process for which the interarrival distribution is exponential is called a homogeneous Poisson process (or simply a Poisson process) and denoted by HPP. Based on the Poisson distribution, the expected value for N(t), that is, M(t), is λt and the variance is also λt. Similarly, based on the exponential distribution, the mean or expected value for the interarrival times is 1/λ and the variance is 1/λ 2 . Note that this probability statement is consistent with interpretation of the CDF as the probability of failure in the interval (0,t). The probability of at least one failure is 1 minus the probability of no failures. For the Poisson process, the probability of no failures in the interval (0,t) is P[ N (t) = 0] = e − λt. The probability of at least one failure is 1 − e − λt, which is the same expression as the CDF for an exponential distribution with failure rate λ. We see that a Poisson process represents both a renewal process with independent exponentially distributed interarrival times having mean 1/λ and a counting process having the integervalued Poisson distribution with mean rate λ. EXAMPLE 12.5 HPP PROBABILITY ESTIMATES Assume an HPP with rate λ = 0.003 repairs per hour. If the system runs for t = 500 hours, what is the probability of at least one failure? What is the expected number of failures?

Solution The expected number of failures in 500 hours is λt = 0.0003 × 500 = 0.15. The probability of zero failures is e −0.15 = 0.861. The probability of at least one failure is thus 1 − 0.861 = 0.139.

442

Applied Reliability

The MTBF for an HPP is constant and equals the mean time to failure (MTTF) for the exponential distribution of interarrival times. Thus,

MTBF = θ = 1/λ = MTTF and we can write the Poisson distribution for an HPP in terms of the MTBF as

P[N (t ) = k ] =

(t /θ)k e −t / θ k!

The expected number of repairs in time t is t/θ. EXAMPLE 12.6 HPP ESTIMATES IN TERMS OF THE MTBF Assume an HPP with MTBF = 10,000 hours. What is the probability of one failure in 3 months?

Solution The expected number of failures in 3 months is 91 × 24/10,000 = 0.218. The probability of exactly one failure is P[N(t ) = 1] = 0.218e −0.218 = 0.1756. EXERCISE 12.5 The MTBF for a single system is 10 hours. Assume an HPP. What is the expected number of repairs in a 24-hour period? What is the probability of observing exactly one repair in the same period? What is the probability of no failures in 24 hours? What is the probability of at least one failure?

By multiplying the HPP distribution probabilities—based on historical values of the MTBF—by the number of systems, we can estimate the expected distribution of failures across any number of similar HPP systems over a future time period. An example of such an application involving nearly 500 servers in a datacenter is described by Trindade (2009). EXERCISE 12.6 The historical MTBF for a group of servers over the previous few months is 10,000 hours. There are 100 servers in use. Over the next 60 days, what are the expected number of systems with no repairs, one repair, two repairs, and more than two repairs?

12.7.1 Distribution of Repair Times for HPP For the HPP, we can write the time T ( k ) to the kth replacement as the sum of k independent exponential random variables. This process is equivalent to that of a k level standby system introduced in Chapter 8. Consequently, the PDF for T ( k ) is the gamma distribution with parameters λ and k. That is,

f k (t) =

λk t k −1e − λt ( k − 1)!

Since T ( k ) has a gamma distribution, the mean time to the kth repair is k/λ = kθ, and its variance is k/λ 2 = kθ2, where the MTBF θ = 1/λ. Note for k = 1, the gamma distribution becomes the exponential distribution with PDF f (t) = λe − λt.

443

Repairable Systems Part I

Because of the equivalence of the two representations of a Poisson process, the CDF for T ( k ) at time t can be viewed as the probability that k or more replacements occur by time t, which is the same as 1 – probability that k − 1 or less replacements occur by t. Thus, we can express the gamma CDF using a sum of Poisson probabilities, that is, k −1

Fk (t) = 1 −

∑ i= 0

(λt)i e − λt = 1 − e − λt i!

k −1

∑ (λit!)

i

i= 0

This information is important for deciding how many spare parts to stock to ensure a system operates during a mission period t. Equivalently, this CDF expression states the probability that the time T ( k ) for the kth replacement is less than or equal to t. EXAMPLE 12.7 TIME TO KTH REPAIR FOR HPP PROCESS The interarrival times are described by an HPP with MTBF = 1000 hours. Create a graph that shows the CDF for the time to the kth repair, where k = 1, 2, 3, 4, and 5. Estimate the median times for k = 1, 2, 3, 4, and 5 repairs. Estimate the probability of k = 1, 2, 3, 4, and 5 repairs at 2000 hours.

Solution To create the graph, we will use the Excel function GAMMADIST(time, k, MTBF, 1). We set up a spreadsheet as partially shown in Figure 12.25. The graph is shown as Figure 12.26. Iterating on the Excel formula, we see that the approximate median times estimates for k = 1, 2, 3, 4, and 5 repairs are 693, 1678, 2674, 3672, and 4671 hours, respectively. The probabilities of k = 1, 2, 3, 4, and 5 repairs at 2000 hours are 0.8647, 0.5940, 0.3233, 0.1429, and 0.0527, respectively. EXERCISE 12.7 Assuming an HPP with MTBF of 24 hours, find the probability that the first failure occurs by 10 hours. Find the probability that two failures occur by 24 hours.

FIGURE 12.25 Partial spreadsheet for time to kth repair.

444

Applied Reliability

1 0.9

Cumulative probability

0.8 0.7 k=1

0.6

k=2

0.5

k=3

0.4

k=4 k=5

0.3 0.2 0.1 0

0

500

1000

1500 2000 Time (hours)

2500

3000

3500

FIGURE 12.26 Gamma distribution CDF for time to kth repair, MTBF = 1000 hours.

EXAMPLE 12.8 SPARE PARTS FOR AN HPP Components are assumed to fail with an exponential distribution having failure rate λ = 0.00030 = 0.030%/hour. The mission duration is 500 hours. To assure a 95% probability of successful mission completion, how many spare parts should be carried for a single-component system?

Solution The expected number of failures in 500 hours is λt = 0.0003 × 500 = 0.15. The probability of zero failures is e −0.15 = 0.861. The probability of exactly one failure is 0.15e −0.15 = 0.129. Thus, the probability of two or more failures is 1 − 0.861 − 0.129 = 0.010 or 1.0%. Carrying one spare part will assure system operation during mission with nearly 99% probability. Note that two parts in total are required: the first is the original part, and the second is the spare. A spreadsheet can be easily set up to provide the number of spare parts needed to achieve a desired probability of mission success over a specified time period for an HPP. Two equivalent approaches are possible: one based on the Poisson distribution and one based on the gamma distribution. For Poisson probabilities, one would enter POISSON(#spares, mean, 1) into a cell, where the mean = λt. For probabilities based on the gamma distribution, the entry would be 1− GAMMADIST(time, #spares + 1, 1/λ, 1). The number of spares (#spares) is then varied to reach the target mission success probability. For example, let us assume the failure rate λ = 0.0025 = 0.25%/hour and the mission length is 1000 hours. How many spares are required for a 99% probability of mission success? Figure 12.27 shows a spreadsheet with input values and a table of possible number of spares in column A evaluated by the Poisson formula based on values in rows 1, 2, and 3. Specifically, the formula in cell B6 is =POISSON($A6,$B$3,1), and the remaining rows in column B are extended using the fill handles in the spreadsheet. We see that seven spares are required to achieve a mission success probability of at least 99%. EXERCISE 12.8 A manufacturer uses a part that historically has failed in a system with an exponential distribution with failure rate λ = 0.000065 = 0.0065%/hour. The manufacturer wants to

445

Repairable Systems Part I

FIGURE 12.27 Spreadsheet example for spare parts determination.

be sure that an adequate number of spare parts are available over a 1-year period (8766 hours). To assure a 90% probability of spares being available, how many spare parts should be stockpiled for each system?

In summary, the number of repairs N(t) occurring by time t is an example of a stochastic counting process. A counting process possesses independent increments if the number of events in nonoverlapping intervals of time are independent. A counting process is said to have stationary increments if the distribution of the number of events that occur in an interval of time depends only on the length of the interval and not on the endpoints of the interval. (A renewal process is an example of a stationary process.) A process with stationary and independent increments is equivalent to a process that restarts itself, in a probabilistic sense, at any point in time (i.e., it has no memory). The Poisson process has stationary independent increments since the number of repairs depends only on the elapsed time and not on the start time. For the Poisson process, the interarrival times are independent exponentially distributed random variables. For further discussion of counting processes, see Ross (2003). Other models and methods for the statistical analysis of recurrent events are presented in Cook and Lawless (2007). EXAMPLE 12.9 MEMORYLESS PROPERTY OF THE POISSON PROCESS Consider a Poisson process in which renewals occur at the rate of λ = 0.5 per month. Find the expected system age at the fifth renewal. What is the probability that the time difference between the fifth and sixth renewals is greater than 1 month?

Solution The expected waiting time for the fifth failure is given by the gamma distribution, with mean time to the fifth repair given by E[t5] = k/λ = 5/0.05 = 10 months. Since we have independent and stationary increments for the Poisson process, the probability that the elapsed time between the fifth and sixth renewal exceeds 1 month is the same as the probability that the time between any consecutive renewals exceeds 1 month. Using the time to the first renewal, we get

P[ X6 > 1] = P[ X1 > 1] = e −1λ = e −0.5 = 0.607

446

Applied Reliability

EXERCISE 12.9 Suppose a component failure distribution is exponential with mean rate λ = 0.3%/K. For a renewal process involving a single-component system, what is the probability of no replacements in 4000 hours of system operation? EXERCISE 12.10 A component failure distribution is exponential with mean rate λ = 6.0%/K. For a renewal process involving a single-component system, how many spare parts should be provided for 10,000 hours of system operation to assure that the probability of a replacement shortage is less than 5%? EXERCISE 12.11 When the soil in an agricultural study reaches a specified level of dryness, an automatic misting follows that lasts for 2 minutes. Assume that the number of mistings per day is a Poisson process with rate of 0.2 per hour. What is the expected number of mistings between midnight and noon? What is the expected waiting time to the third misting? Given that three mistings have occurred, what is the probability that the time to the next misting is greater than 2 hours?

12.8 MTBF and MTTF for a Renewal Process Consider a renewal process truncated at either a specified time or the last repair. Ignoring the time to make repairs, the MTBF can be estimated by dividing the system age T by the number of failures k, that is, =T MTBF k

Also, for a renewal process, we can treat the interarrival times as if they arise from an independent sample of nonrepairable parts from a single distribution, as discussed in Example 12.1. Thus, the MTTF for the interarrival times is estimated by the average of the times between failures, including any accumulated time for censored observations. Since the sum of the interarrival times and accumulated time since the last repair must equal the system age, we obtain the same result for both the MTTF and the MTBF calculations. Thus, for a renewal process, the MTBF and the MTTF are equivalent. Note it is not meaningful to talk about a single MTBF for a nonrenewal process since the rate of repairs is changing. If we have an HPP in which the interarrival times are independent and exponentially distributed, then—similar to the discussion of confidence intervals for failure rates in Chapter 3—we can easily determine confidence intervals for the MTBF. In a renewal process, we are replacing the failed units. If the renewal process is stopped at a fixed number of failures (failure-censored data), the one-sided LCL on the MTBF is given by

2k χ

2 2 k ;100 α

MTBF

447

Repairable Systems Part I

2

where χ 2 k ;100α is the 100α percentile of the chi-squared distribution with 2k degrees of freedom. The spreadsheet expression for this limit is 2*k*MTBF/CHIINV(α,2*k). More commonly, the renewal process is terminated at a fixed time (time-censored data), for which the one-sided LCL on the MTBF is given by 2k

χ

2 2 ( k + 1);100 α

MTBF

The spreadsheet form is 2*k*MTBF/CHIINV(α,2*k+2). For either failure-censored or timecensored data, the one-sided UCL on the MTBF is given by 2k MTBF χ 22 k ;100(1−α )

The spreadsheet expression for this limit is 2*k*MTBF/CHIINV(1-α,2*k). We can view the bracketed expression in the last three equations as a factor multiplied by the estimated MTBF to obtain the desired limits. Tables 12.5 (failure censoring) and 12.6 (time censoring) provide the one-sided lower bound factors for the MTBF at various confidence levels. Table 12.7 (for either failure or time censoring) gives the one-sided upper bound factors. To determine the specific factor, find—in the appropriate table—the intersection of the number of fails and the desired confidence level. Note that Table 12.6 also gives factors for zero failures: multiply the system age by the factor to obtain an MTBF lower bound. TABLE 12.5 One-Sided Lower Confidence Bound Factors for the MTBF (Failure-Censored Data) Confidence Level

Number of Failures

60%

70%

80%

85%

90%

95%

97.5%

1 2 3 4 6 7 8 9 10 12 15 20 30 50 100 250

1.091 0.989 0.966 0.958 0.954 0.953 0.954 0.954 0.955 0.956 0.958 0.961 0.966 0.971 0.978 0.985

0.831 0.820 0.830 0.840 0.856 0.863 0.869 0.874 0.878 0.886 0.895 0.906 0.920 0.935 0.952 0.969

0.621 0.668 0.701 0.725 0.759 0.771 0.782 0.791 0.799 0.812 0.828 0.846 0.870 0.896 0.923 0.950

0.527 0.593 0.635 0.665 0.706 0.721 0.734 0.745 0.755 0.771 0.790 0.812 0.841 0.872 0.906 0.938

0.434 0.514 0.564 0.599 0.647 0.665 0.680 0.693 0.704 0.723 0.745 0.772 0.806 0.844 0.885 0.924

0.334 0.422 0.477 0.516 0.571 0.591 0.608 0.623 0.637 0.659 0.685 0.717 0.759 0.804 0.855 0.904

0.271 0.359 0.415 0.456 0.514 0.536 0.555 0.571 0.585 0.610 0.639 0.674 0.720 0.772 0.830 0.887

Note: Use for failure-censored data to multiply the MTTF or MTBF estimate to obtain a lower bound at the given confidence level.

448

Applied Reliability

TABLE 12.6 One-Sided Lower Confidence Bound Factors for the MTBF (Time-Censored Data) Confidence Level

Number of Failures

60%

70%

80%

85%

90%

95%

97.5%

0 1 2 3 4 5 6 7 8 9 10 12 15 20 30 50 100 250

1.091 0.494 0.644 0.719 0.764 0.795 0.817 0.834 0.848 0.859 0.868 0.883 0.899 0.916 0.935 0.953 0.969 0.982

0.831 0.410 0.553 0.630 0.679 0.714 0.740 0.760 0.777 0.790 0.802 0.821 0.841 0.864 0.891 0.918 0.943 0.965

0.621 0.334 0.467 0.544 0.595 0.632 0.661 0.684 0.703 0.719 0.733 0.755 0.780 0.809 0.844 0.879 0.915 0.946

0.527 0.297 0.423 0.499 0.550 0.589 0.618 0.642 0.662 0.679 0.694 0.718 0.745 0.777 0.816 0.856 0.897 0.935

0.434 0.257 0.376 0.449 0.500 0.539 0.570 0.595 0.616 0.634 0.649 0.675 0.704 0.740 0.783 0.829 0.877 0.921

0.334 0.211 0.318 0.387 0.437 0.476 0.507 0.532 0.554 0.573 0.590 0.617 0.649 0.688 0.737 0.790 0.847 0.901

0.271 0.179 0.277 0.342 0.391 0.429 0.459 0.485 0.508 0.527 0.544 0.572 0.606 0.647 0.700 0.759 0.822 0.883

Note: Use for time-censored data to multiply the MTTF or MTBF estimate of obtain a lower bound at the given confidence level. Note: For 0 failures, multiply the systems operating hours by the factor corresponding to the desired confidence level.

TABLE 12.7 One-Sided Upper Confidence Bound Factors for the MTBF (Failure Or Time-Censored Data) Confidence Level

Number of Failures

60%

70%

80%

85%

90%

95%

97.5%

1 2 3 4 5 6 7 8 9 10 12 15 20 30 50 100 250

1.958 1.453 1.313 1.246 1.205 1.179 1.159 1.144 1.133 1.123 1.108 1.093 1.077 1.060 1.044 1.029 1.018

2.804 1.823 1.568 1.447 1.376 1.328 1.294 1.267 1.247 1.230 1.203 1.176 1.147 1.115 1.085 1.058 1.035

4.481 2.426 1.954 1.742 1.618 1.537 1.479 1.435 1.400 1.372 1.329 1.284 1.237 1.185 1.137 1.093 1.057

6.153 2.927 2.255 1.962 1.795 1.687 1.610 1.552 1.507 1.470 1.414 1.357 1.296 1.231 1.170 1.115 1.070

9.491 3.761 2.722 2.293 2.055 1.904 1.797 1.718 1.657 1.607 1.533 1.456 1.377 1.291 1.214 1.144 1.087

19.496 5.628 3.669 2.928 2.538 2.296 2.131 2.010 1.917 1.843 1.733 1.622 1.509 1.389 1.283 1.189 1.113

39.498 8.257 4.849 3.670 3.080 2.725 2.487 2.316 2.187 2.085 1.935 1.787 1.637 1.482 1.347 1.229 1.137

Note: Use for failure-censored or time-censored data to multiply the MTTF or MTBF estimate to obtain an upper bound at the given confidence level.

449

Repairable Systems Part I

EXAMPLE 12.10 CONFIDENCE BOUNDS ON THE POPULATION MTBF FOR AN HPP The following ages (in hours) at repair are observed for a system observed for 600 hours: 90, 133, 200, 294, 341, 360, 468, and 562. Assuming an HPP, estimate the MTBF and a 90% confidence interval on the true MTBF.

Solution The MTBF estimate is 600/8 = 75 hours. From Table 12.6, the single-sided, 95% confidence level, lower factor for eight fails is 0.554. From Table 12.7, the upper 95% factor is 2.010. The 90% confidence interval on the true MTBF is, thus, 41.55–150.75 hours. Alternatively, the two-sided confidence limits for the MTBF can be obtained by spreadsheet functions. The LCL is 2*8*75/CHIINV(0.05,2*8+2) = 41.57 for time-censored data. The UCL is 2*8*75/CHIINV(0.95,2*8) = 150.72.

Table 12.8 is an additional table useful for determining the length of test required for an HPP in order to demonstrate a desired MTBF at a given confidence level if r failures occur. We show the use of this table in Example 12.11. Alternatively, the spreadsheet function −LN(α) for zero failures or CHIINV(α,2*k + 2)/2 for k equal to one or more failures provides the test length factor. EXAMPLE 12.11 TEST LENGTH GUIDE FOR AN HPP We demonstrate an MTBF of at least 168 hours at a 90% confidence level. How many hours should we run the test if we allow up to three failures?

Solution From Table 12.8, we see that 6.68 × 168 hours = 1,122.24 hours are required with at most three failures to demonstrate the desired MTBF. Obviously, the test is time censored. Alternatively, the test length factor is obtainable from the spreadsheet function CHIINV(0.10,8), which returns the value 6.68. TABLE 12.8 Test Length Guide Confidence Level

Number of Failures

50%

60%

75%

80%

90%

95%

97.5%

0 1 2 3 4 5 6 7 8 9 10 15 20 30 50

0.693 1.68 2.67 3.67 4.67 5.67 6.67 7.67 8.67 9.67 10.67 15.67 20.67 30.67 50.67

0.916 2.02 3.11 4.18 5.24 6.29 7.34 8.39 9.43 10.48 11.52 16.69 21.84 32.09 52.49

1.39 2.69 3.92 5.11 6.27 7.42 8.56 9.68 10.80 11.91 13.02 18.49 23.88 34.55 55.62

1.61 2.99 4.28 5.52 6.72 7.91 9.08 10.23 11.38 12.52 13.65 19.23 24.73 35.56 56.89

2.30 3.89 5.32 6.68 7.99 9.27 10.53 11.77 12.99 14.21 15.41 21.29 27.05 38.32 60.34

3.00 4.74 6.30 7.75 9.15 10.51 11.84 13.15 14.43 15.71 16.96 23.10 29.06 40.69 63.29

3.69 5.57 7.22 8.77 10.24 11.67 13.06 14.42 15.76 17.08 18.39 24.74 30.89 42.83 65.92

107.58

109.35

114.07

118.08

121.63

100

100.67

103.23

Note: Multiply desired MBTF by factor to determine test time needed to demonstrate desired MTBF at a given confidence level, if k failures occur.

450

Applied Reliability

EXERCISE 12.12 The following ages (in hours) at repair are recorded for a system truncated at the last failure: 3, 23, 32, 59, 143, and 162. Estimate the MTBF assuming an HPP. Provide a 95% confidence interval for the true MTBF. EXERCISE 12.13 A customer requires that the system our company manufactures demonstrate an MTBF of at least 500 hours. How many hours do we have to run the test for confirmation at the 95% confidence level, assuming an HPP for repairs? Allow at most one failure. EXERCISE 12.14 Consider the following cumulative replacement times for a single-component system: 40, 150, 400, 430, 500, 670, 750, and 800 hours. The interarrival times X i , (i = 1, … , 8), are 40, 110, 250, 30, 70, 170, 80, and 50 hours. Assume a Poisson process, failure-censored observations, and estimate the failure rate λ. Give 90% upper and lower limits on the failure rate. (Hint: Use tables for MTBF and note failure rate = 1/MTBF.)

12.9 MTTF and MTBF Two-Sample Comparisons In Chapter 8, we discussed the likelihood ratio test (LRT) for testing the hypothesis that two samples come from the same distribution. A fairly common problem in reliability analysis for both repairable and nonrepairable systems involves testing the equality of the means of two exponential distributions. For nonrepairable systems, the mean is the MTTF, which is the reciprocal of the constant hazard rate λ. For an HPP in a repairable system, the mean is the MTBF, which is the reciprocal of the constant RR λ. The mean in either case is estimated by dividing the total time on test by the total number of failures or repairs for each group. Thus, the LRT can be similarly applied for both count-censored and time-censored observations. We now show how the LRT can be used to test the equality of two exponential means. An alternative procedure by Cox (1953), which consists of calculating a confidence interval on the ratio of the two estimated means, is described in Appendix 12A. Assume we have two exponential distributions of count-censored data. For a nonrepairable system, the total time on test Ti for each group i = 1, 2 is ri

Ti =

∑t

i( j )

+ (ni − ri )ti( ri )

j=1

where ri is the total number of failures occurring among ni units at times ti( j ), j = 1, 2 , ..., ri . For time-censored data, ti( ri ) in the formula is replaced by the censoring time τ i. The MTTF estimate is θˆ i = Ti ri. Assuming an HPP for repairable systems, the total time on test will be the total age of each system, that is, either the time of the last repair for failure-truncated data or the current age of the system for time-truncated data. For a repairable system with ri repairs by time Ti, the MTBF estimate is θˆ i = Ti ri.

451

Repairable Systems Part I

EXAMPLE 12.12 LIKELIHOOD RATIO TEST FOR COMPARISON OF TWO EXPONENTIAL MTTFS (NONREPAIRABLE COMPONENTS) It is desired to investigate the effects of altitude on radiation-induced soft errors in memory devices. Two groups of memory devices, each consisting of 100 units in a dynamic stress test with continuous monitoring, are exposed to radiation for 1000 hours. The radiation level for group 2 is 10 times the level for group 1. The radiation levels are designed to simulate conditions at ground level compared to a high-altitude environment. The failure times recorded for the two groups are shown in Table 12.9. At a 95% confidence level, and assuming an exponential distribution of failure times, is there evidence that the higher radiation affects the failure probability, that is, are the means different?

Solution We begin by estimating the MTTF for each group separately: r1

θˆ 1 =

∑t

+ (n1 − r1)τ1

i (1)

i =1

=

r1

96871 = 19374.2 5

r2

θˆ 2 =

∑t

i ( 2)

+ (n2 − r2 )τ 2

i =1

=

r2

91703 = 6550.2 14

The likelihood function for the two groups combined after some algebra is

LIK(θˆ 1, θˆ 2 ) =

1 θˆ 1

r1

1 θˆ 2

r2

exp − ( r1 + r2 ) = 7.668E − 84

Next, we estimate the common MTTF under the null hypothesis of equal means: r1

θˆ =

∑

r2

ti (1) + (n1 − r1)τ +

i =1

∑t

i ( 2)

+ (n2 − r2 )τ

i =1

=

r1 + r2

188574 = 9924.9 19

TABLE 12.9 Failure Times in Hours Low Altitude 29 37 42 780 983 – – – – – – – – –

High Altitude 95 133 142 155 156 186 240 275 450 517 664 806 940 944

452

Applied Reliability

The likelihood function for the two groups combined under the null hypothesis after some algebra is r1+ r2

1 θˆ

LIK(θˆ , θˆ ) =

exp [ −(r1 + r2 )] = 6.465E − 85

The LRT statistic is

( )

LIK θˆ , θˆ

LRT = −2ln

(

LIK θˆ 1, θˆ 2

)

= 4.947

Under the null hypothesis, the LRT has an approximate chi-square distribution with 1 degree of freedom for samples of at least 25 (i.e., (n1 + n2 ) ≥ 25). For a 95% confidence level, the spreadsheet function CHIINV(0.05,1) gives a percentile value of 3.84. Since LRT > 3.84, the data provide evidence to reject equality of means. Alternatively, to obtain the significance level, the spreadsheet function CHIDIST(4.947,1) results in a value of 0.026. EXAMPLE 12.13 LIKELIHOOD RATIO TEST FOR COMPARISON OF TWO HPP MTBFS (REPAIRABLE SYSTEMS) Similar to the previous example, engineers have obtained reliability data on radiation-induced soft errors in memory devices on two systems at different altitudes, one at ground level and another at a high-altitude observatory. Each system consists of 200 powered memory devices. Upon failure, a system is restored to operation by immediate rebooting since no physical damage is associated with the soft errors. Each system is observed for 180 days, or 4320 hours. System 1 at ground level experiences 5 failures, and system 2 at high altitude records 11 failures. At a 95% confidence level, and assuming an HPP of failure times, is there evidence that the system MTBFs are different?

Solution Note that the MTBF estimates apply to the system and not to the components internal to the system. When comparing systems, it is important to compare identical configurations. The estimated MTBF for each system separately is 4320 θˆ 1 = = 864 5

4320 θˆ 2 = = 393 11 The likelihood function for the two systems combined is

LIK(θˆ 1, θˆ 2 ) =

1 θˆ

r1

1

1 θˆ

r2

exp − ( r1 + r2 ) = 6.819E − 51

2

Next, we estimate the common MTBF under the null hypothesis of equal means:

2(4320) θˆ = = 540 5 + 11

The likelihood function for the two groups combined under the null hypothesis is

⎛ 1⎞ LIK(θˆ , θˆ ) = ⎜ ⎟ ⎝ θˆ ⎠

r1+ r2

exp [ −(r1 + r2 )] = 2.153E − 51

453

Repairable Systems Part I

The LRT is

LRT = −2ln

( )

LIK θˆ , θˆ

(

LIK θˆ 1, θˆ 2

)

= 2.306

Under the null hypothesis, the LRT has an approximate chi-square distribution with 1 degree of freedom. For a 95% confidence level, the spreadsheet function CHIINV(0.05,1) gives a percentile value of 3.84. Since LRT < 3.84, there is insufficient evidence to reject equality of MTBFs. Alternatively, the significance level from the spreadsheet function CHIDIST(2.306,1) is 0.129. EXERCISE 12.15 Two groups of 50 components (100 total) from different vendors are stressed for 1000 hours. Assume exponentially distributed failure times. Failures are observed on group 1 at times 95, 535, 878, and 926 hours, and on group 2 at times 22, 88, 213, 277, 282, 310, 371, 574, and 864 hours. Use the LRT to determine if the MTTFs from the different vendors are equal. EXERCISE 12.16 Two repairable systems are observed for 1000 hours each. Assume an HPP for each system. The first system has 11 repairs and an estimated MTBF of 91 hours. The second system has three repairs and an estimated MTBF of 333 hours. Is the evidence sufficient to conclude at a 95% confidence level that the two systems are different?

12.10 Availability Up to this point, we have assumed that repair times are negligible compared to the system ages. However, when a system is repaired, the system is not operational and the downtime can be a major issue for businesses that run 24 hours per day and 7 days a week. Hence, we introduce the concept of availability, which takes the downtime into consideration. The model we use for availability is that of an alternating renewal process. A failure is followed by a repair, and the two times involved constitute an availability cycle. There is a random time to failure and a random time to repair. During the repair, the system is down or unavailable. In an alternating renewal process, there are two states: up (system running) and down (system undergoing repair). So we have an alternating sequence of in-service and downtimes. As shown in Figure 12.28, we can represent the two sequences of independent random variables as T, the time to failure, and R, the time to repair. As an example of an alternating renewal process, consider a lightbulb that is replaced upon failure. There are two states: (1) on, and (2) off, undergoing repair. Although probably unrealistic, we may model the rate of failure assuming exponential distributions for the rate of failure l = 1/MTTF and the rate of repair m = 1/MTTR, where MTTR is the mean time to repair. It is convenient to model the availability using Markov models. A full

T1 FIGURE 12.28 Alternating renewal process.

R1

T2

R2

T3

454

Applied Reliability

treatment of Markov models is beyond the scope of this text, but further information can be found in Ross (2003) and Shooman (1990). We show a Markov transition diagram for two states in Figure 12.29. Under the assumption that the transition from one state to the next state depends only on current state, that is, it does not matter how we arrive at a given state, it can be shown that the instantaneous availability at time t is

A(t) =

− MTTF MTTR + e MTTF + MTTR MTTF + MTTR

MTTF × MTTR MTTF + MTTR

For nearly any practical consideration, MTTF >> MTTR, and the second term in the above equation quickly becomes negligible in a short time t. So, the limiting, inherent, or steadystate availability is

A(t) =

MTTF MTTF + MTTR

Average availability can be used to estimate the steady-state availability by measuring the fraction of uptime to the total time (uptime and downtime) over an interval of time. For example, if we have a large group of identical repairable systems and during a 6-month period, the systems on the average are operational 99% of the time and down 1% of the time, the average availability during the interval is 0.99, or what is called “two 9s.” Some machines may have failed and been repaired once, twice, or more times, and other systems may have had no failures. Individual systems may have had average availability from nearly 0% to 100%. In the limit, instantaneous and average converge to the same value, called the limiting, inherent, or steady-state availability. When engineers refer to availability, the implication is limiting availability. Different availability levels are shown in Table 12.10.

λ 1

0 μ

Up

Down

FIGURE 12.29 Markov two-state model.

TABLE 12.10 Different Availability Levels Availability Level 0.99 0.999 0.9999 0.999999

Average Yearly Downtime

Units

3.65 8.77 52.6 31.6

Days Hours Minutes Seconds

455

Repairable Systems Part I

Note that 99.99% (four 9s) availability (on the average) implies downtime of 52.6 minutes per year, of 1 minute per week, or 13 minutes every quarter. However, what actually occurs depends on the MTTF and MTTR values. For example, 0.9999 availability over a year can be achieved with single downtime of 52.6 minutes or with 16 downtimes each lasting 3.29 minutes. The availability is the same, but the business impact can be totally different. In summary, an availability value does not indicate how many machines have failed and been repaired (frequency), what the lifetime is for each system (duration), or what the repair time is for each failure (downtime). Availability is a summary statistic that can hide useful information. It is better to analyze failure and repair distributions separately. The formula for availability does provide guidance on improving availability. The repair time MTTR can be decreased by improving the ability to access, diagnose, and service the problem. The logistics involved in time to respond and availability of spare parts are also important considerations. The effect of a reduction in MTTR on availability can be dramatic. For example, if the MTTF is 10,000 hours and the MTTR is 10 hours, the availability is 0.999. Reducing the MTTR to only 1 hour will increase availability to 0.9999. Availability can also be improved by using redundancy. Since availability is a probability, availability for an active parallel system can be easily determined. For example, we assume steady-state availability of 0.999 for each system in a parallel network. Both systems must be unavailable before the network fails. Network availability is 1 − (1 − 0.999) (1 − 0.999) = 0.999999. More complicated configurations (such as k out of n functioning systems) can be solved through modeling or through simulation. For further information, discussion of availability theory is presented in Barlow and Proschan (1975) and availability engineering concepts in Elsayed (1996).

12.11 Renewal Rates The renewal intensity (renewal rate) at time t is the mean number of renewals per unit time expected at time t and equals the derivative of the renewal function, that is,

m(t) =

dM(t) dt

For a stationary process, Cox (1962) shows that m(t) is equal to the reciprocal of the mean of the distribution of the interarrival times. For a Poisson process, the renewal rate is constant and equal to λ, and the independent, exponentially distributed interarrival times have mean equal to 1/λ. Also, as shown by Barlow and Proschan (1975), the cumulative renewal function is linear in time t, that is, M(t) = λt. For example, for a Poisson renewal process with intensity λ = 0.025%/hour, the expected number of repairs in 5000 hours is simply 0.00025 × 5000 = 1.25. The mean of the interarrival times is 1/λ = 4000 hours. For a renewal process in general, one can show (see discussion in Cox 1962) that in the limit as t goes to infinity, M(t) approaches a linear function of t and hence the renewal rate m(t) becomes constant. EXERCISE 12.17 A Poisson renewal process has intensity λ = 0.05%/hour. Find the expected number of repairs in 1 year. What is the mean time between repairs? What is the expected number of repairs between the first and second years?

456

Applied Reliability

12.12 Simulation of Renewal Processes We discussed in Chapter 3 how to simulate lifetimes from any failure distribution. For example, a simulated failure time from an exponential distribution with MTBF = 1000 can be done in a spreadsheet using the formula −1000*LN(RAND()). To simulate a renewal process, such as an HPP, we generate any desired number of pseudorandom exponential variables T1 , T2 , T3 , ... , Tr and then sequentially sum the lifetimes associated with cumulative counts. Thus, the first event occurs at T1, the second at T1 + T2, the third at T1 + T2 + T3, and so on. In this manner, one can visualize the variation that exists in random data from a renewal process. One can observe the variation in the number of counts at a fixed censoring time or the variation in ages for a fixed failure count. EXAMPLE 12.14 SIMULATION OF 10 TIME-CENSORED HPPS In a spreadsheet, simulate an HPP across 10 systems, each with MTBF = 1,000 hours and a censoring time of 10,000 hours. Plot the 10 cumulative plots together in a single graph.

Solution Part of the spreadsheet is shown in Figure 12.30. The MTBF value is entered into cell B1. In cell A4, for example, we enter the formula −$B$1*LN(RAND()). Then, into cell A5, we enter the formula −$B$1*LN(RAND())+A4. Using the autofill feature, we complete columns A through J to any desired count in column K. Then, we create the plot shown as Figure 12.31 by fixing the x-axis at 10,000. EXERCISE 12.18 In a spreadsheet, simulate an HPP across 10 systems, each with MTBF = 1000 hours. Censor each system at 10 failures. Plot the 10 cumulative plots together in a single graph.

FIGURE 12.30 Partial spreadsheet for 10 HPP systems with MTBF = 1000.

457

Repairable Systems Part I

20

Cumulative repairs

18 16

System 1

14

System 2

12

System 3 System 4

10

System 5

8

System 6

6

System 7

4

System 8

2 0

System 9 0

2000

4000

Age

6000

8000

10000

System 10

FIGURE 12.31 Cumulative plots of 10 simulated HPP systems with MTBF = 1000 (censored at 10,000 hours).

12.13 Superposition of Renewal Processes Consider a multicomponent system in which several identical parts can fail, causing the system to cease operation. Again, the system is restored to operation by replacement of the failed part with a new part from the same parent population. At the system level, we observe the superposition of the individual renewal processes. In general, a pooled process is not a renewal process, that is, the times between system repairs are usually not i.i.d. However, if each renewal process is a Poisson process (i.e., the interarrival times are independent and exponentially distributed), then the superposition is also a Poisson renewal process with renewal rate equal to the sum of the component renewal rates. Figure 12.32 is an example of the repair patterns for a system with three components. We assumed identical components, with exponential renewal rate λ, but in fact this result holds in general for the sum of any number of independent Poisson processes with differc

∑

λi , ing λ i . This subject is treated further in Cox (1962). The system renewal rate is λ s = where c is the number of socket positions. i= 0 When the interarrival times for the individual renewal processes are not exponentially distributed, the process is not a Poisson process, and consequently the superposition will not be a Poisson process (or even a renewal process). However, for sufficiently long times, the superposition process converges to the stationary Poisson process under certain conditions described by Cox (1962). EXERCISE 12.19 Use the simulation procedures for the exponential distribution (described in Chapter 3) to generate 10 consecutive exponential random observations with λ = 1%/khrs for each socket position of a three-socket system. Illustrate the superposition pattern with a sketch. Plot the cumulative number of system repairs versus age. Does the plot appear linear? Based on the methods of Chapter 6, check the fit of the system interarrival times to the exponential model. Estimate the system MTTR.

458

Applied Reliability

System repair times

Component 3 Component 2 Component 1

= replacement

FIGURE 12.32 Superposition of renewal processes for system of three components.

12.14 CDF Estimation from Renewal Data (Unidentified Replacement) Let us now consider a common type of restoration involving a renewal process. Suppose we have a system (e.g., computer) consisting of c components from the same parent population. The components fail independently with a lifetime distribution F(t). We wish to obtain a nonparametric estimate of F(t), or equivalently the reliability function R(t) = 1 − F(t), from system lifetime data. Upon failure of a component, a system ceases operation. However, the system is immediately restored to operation by replacement of the failed component with another component from the same population. Moreover, we consider the special case in which only the failure ages of the system, and not the failure times of the individual components, are recorded, a condition we refer to as “unidentified replacement.” This situation may arise because records are not kept of the site of each failed component and its replacements. For example, a drilling machine tool may have many bits that are individually replaced upon failure, but only the times of machine stoppage are recorded and not the particular bit position where replacement occurred. Hence, except for a one-component system, it becomes impossible to determine whether any subsequent failures occurred on original components or on components that were replacements for original units after the first failure. Since we assume the original and replacement component lifetimes are a random sample from the component lifetime distribution F(t), the sequence of interreplacement times for a single-component position can be viewed as an ordinary renewal process. For a system, we then have a superposition of such renewal processes as shown in Figure 12.33. Given information on only the system failure times, how do we estimate the cumulative distribution function F(t) of the component lifetimes? The approach to estimating F(t) relies on the fundamental renewal equation, which relates F(t) and the system mean cumulative renewal function M(t) by means of a convolution integral. M(t) is the expected number of component renewals or replacements made through time t. First, M(t) is estimated from system failure ages, and then F(t) is estimated by numerical deconvolution of the renewal equation. This topic has been extensively treated by Trindade and Haugh (1979, 1980). They cover the multicensored situation of N systems having different operating hours. In addition, they have investigated the statistical properties of various numerical deconvolution methods. We shall describe one approach below. An alternative recursive method is presented in Appendix 12A.

459

Repairable Systems Part I

1 Unidentified component position fail times

0

2

t1

t3

0

Observed system fail times

0

t

t2

0

c

t

t4

t1

t5

t2 t3

t4

t5

t

t

FIGURE 12.33 System of c components viewed as a superposition of renewal processes.

A well-known relationship exists between the CDF F(t) for component failure times and the mean cumulative renewal function M(t). Called the fundamental renewal equation (Barlow and Proschan 1975), the relation is t

M(t) = F(t) +

∫ M(t − x)dF(x) 0

An unbiased estimator of M(t), for a single system of c components, is just ˆ (t) = n(t) M c

where n(t) is the number of renewals for all component positions by time t. Note the estiˆ ( t ) is a step function with jumps at the failure times. mator M Assume the repair times occur at ti for i = 1, 2 , 3, ... We will use the notation M(ti ) = Mi

F(ti ) = Fi t1

For t = 0, M(0) = F(0) = 0. At t = t1, we have F1 = M1 −

∫ M(t − x)dF(x). We now assume 1

0

M(t1 − x) can be approximated as having a constant value M(t1 − x) = t1

val [ 0, t1 ]. Thus, F1 = M1 − Solving for F1, we get

1 1 M1 dF( x) = M1 − M1F1. 2 2

∫ 0

F1 =

M1 1 1 + M1 2

1 M1 during the inter2

460

Applied Reliability

At t = t2, the equation can be written in expanded form as t2

F2 = M2 −

∫

t2

t1

M(t2 − x)dF( x) = M2 −

0

∫

M(t2 − x)dF( x) +

0

∫ M(t − x)dF(x) 2

t1

1 Assume that M(t2 − x) can be approximated as a constant value M(t2 − x) = M1 during 2 1 the interval [0, t1 ] and as M(t2 − x) = ( M1 + M2 ) during the interval [t1 , t2 ]. Thus, the equa2 tion reduces to F2 = M2 −

1 1 M1F1 + ( M1 + M2 )( F2 − F1 ) 2 2

Solving for F2, we get

F2 =

M2 + 1+

1 M2 F1 2

1 ( M1 + M2 ) 2

At t = t3, we write the equation as t3

F2 = M3 −

t1

t3

t2

∫ M(t − x)dF(x) = M − ∫ M(t − x)dF(x) + ∫ M(t − x)dF(x) + ∫ M(t − x)dF(x) 3

3

0

3

0

3

t1

3

t2

1 Again, assume that M(t3 − x) has constant value M(t3 − x) = M1 during the interval [0, t1 ], 2 1 1 M(t3 − x) = ( M1 + M2 ) during [t1 , t2 ], and M(t3 − x) = ( M2 + M3 ) during [t2 , t3 ]. Thus, the 2 2 form becomes

F3 = M3 −

1 1 1 M1F1 + ( M1 + M2 ) ( F2 − F1 ) + ( M2 + M3 ) ( F3 − F2 ) 2 2 2

Solving for F3, we have

F3 =

M3 +

1 1 M2 F1 + ( M3 − M1 ) F2 2 2 1 1 + ( M2 + M3 ) 2

Similarly,

F4 =

M4 +

1 1 1 M2 F1 + ( M3 − M1 ) F2 + ( M4 − M2 ) F3 2 2 2 1 1 + ( M3 + M 4 ) 2

We see the pattern evolving and consequently we can write the general expression

461

Repairable Systems Part I

Fi =

Mi +

1 1 1 1 M2 F1 + ( M3 − M1 ) F2 + + ( Mi−1 − Mi− 3 ) Fi− 2 + ( Mi − Mi− 2 ) Fi−1 2 2 2 2 1 1 + ( Mi − 1 + Mi ) 2 EXAMPLE 12.15 RENEWAL DATA CALCULATION OF CDF

Consider a repairable system of 25 components. System failures were reported at 35, 79, 142, and 206 hours. Failed components were replaced with new components from the same population, with unidentified replacement. Estimate the component CDF F (t ).

Solution The renewal function estimates are ˆ =M ˆ (35) = 1/ 25 = 0.04 M 1 ˆ =M ˆ (79) = 2/ 25 = 0.08 M 2

ˆ =M ˆ (142) = 3/ 25 = 0.12 M 3 ˆ =M ˆ ( 206) = 4/ 25 = 0.16 M 4 The CDF estimates are Fˆ1 = Fˆ(35) =

ˆ (35) M = 0.039 ˆ (35) 1+ 12 M

ˆ (79) + 1 M ˆ (35)Fˆ(35) M 2 = 0.077 Fˆ2 = Fˆ(79) = 1 ˆ ˆ (79) 1+ 2 M(35) + M Fˆ3 = Fˆ(142) =

Fˆ4 = Fˆ( 206) =

ˆ (142) + 1 M ˆ (79)Fˆ(35) + M 2 1+

1 2

ˆ (142) − M ˆ (35) Fˆ(79) M

1 2

ˆ (79) + M ˆ (142) M

ˆ ( 206) + 1 M ˆ (79)F( ˆ(35) + M 2

1 2

= 0.113

ˆ (142) − M ˆ (35) Fˆ(79) + M 1+

1 2

ˆ (142) + M ˆ ( 206) M

1 2

ˆ ( 206) − M ˆ (79) Fˆ(142) M

= 0.148

EXERCISE 12.20 Consider, using the data in Example 12.12, a fifth renewal at 306 hours. Write the expression for Fˆ (t5 ). Solve for Fˆ (306). EXERCISE 12.21 Write the expression for Fˆ (t6 ). EXERCISE 12.22 The renewal function for an HPP with MTBF θ is M(t) = t/θ. Assume that a repairable system with MTBF = 1000 hours has repair times at 75, 175, 300, 475, 675, and 900 hours. Compare the exact values of the CDF exponential distribution F(t) at the repair times with the F(t) estimates obtained by the fundamental renewal equation.

462

Applied Reliability

12.15 Summary In this chapter, we introduced some important concepts on repairable systems. We emphasized the distinction between analysis of repairable and nonrepairable systems. We concentrated on renewal processes. These ideas illustrate how different approaches are useful for providing for spare parts, estimating repair schedules, determining costs, and so on. We have shown graphical and modeling methods useful for analyzing data from repairable systems. In Chapter 13, we extend repairable system concepts to nonrenewal processes.

Appendix 12A 12.1A True Confidence Limits for the MCF We follow the approach described by Lawless and Nadeau (1995). However, we make a slight modification in the notation terminology to include both repairs and censoring times as events. For simplicity, the authors use a discrete time framework, in which an event (repair or censoring) occurs at the distinct time t, where t = 0, 1, 2, 3, … Each time corresponds to the age at which an event occurs. We need to define terms used in the formulas for the MCF M(t) and the variance V (t). Let ni (t) represent the number of events that occur at time t for system i so that m(t) = E[ni (t)] and M(t) =

∑

t

s= o

m( s). Here ni (t) = 1 for a repair at t and ni (t) = 0 for a censored observa-

tion at t. System i is observed over the interval [0, τ i ], that is, τ i is the censoring time for system i. A system at risk is indicated by δ i (t) = 1 for t ≤ τ i (system at risk) and δ i (t) = 0 if t > τ i (censored observation). The total number of events observed by time t is denoted by n.(t) =

∑

k

i=1

δ i (t)ni (t), and the total number of systems at risk at time t is δ.(t) =

∑

k

i=1

δ.(t).

The k systems are assumed mutually independent. If t1 < t2 < < tr are the distinct times for the occurrence of events across all systems, then the MCF estimate is

ˆ (t) = M

t

∑ s= 0

t

ˆ ( s) = m

∑ nδ.(.(ss)) . s= 0

The variance estimate is

Vˆ (t) =

k

t

i=1

s= 0

∑∑

δ.( s) ˆ ( s)] [ni ( s) − m δ.( s)

2

We now show how these formulas can be evaluated in a spreadsheet. The data are the times of repairs and censoring for the five systems in Table 12.1 and displayed again in Table 12.1A.

463

Repairable Systems Part I

TABLE 12.1A Repair Histories for Five Systems Age at Repairs System i

ti1

ti2

ti3

ti4

ti5

Censor Age τi

1 2 3 4 5

222 273 125 63 91

584 766 323 195 427

985 1054 – 325 761

1161 – – – 1096

–

1901 1316 442 636 2214

– – – 1796

FIGURE 12.1A Spreadsheet setup for variance estimates.

We will show the spreadsheet procedure for a subset of the data. The full analysis is left as an exercise. We begin by setting up the spreadsheet for estimating the MCF according to the notation requirements as illustrated in Figure 12.1A. Note that the cells for systems 2 and 3 are shaded in grey in rows 6 and 7, respectively, to show censoring ages. After the censoring ages, those systems are no longer at risk. In Figure 12.2A, we show the steps involved in the variance estimate. Rows 25 through 29 evaluate the formula δ i (t) δ.(t) [ ni (t) − m(t)] for each system i at each value of t. Rows 31 through 35 perform a consecutive sum at each t value for each system. Rows 37 through 41 square each consecutive sum. Finally, the variance estimate in row 43 is obtained by summing rows 37 through 41 vertically across all systems in at each t value. The MCF standard error (σˆ M (t)) can be obtained by taking the square root of the variance. Then, the upper and lower 95% confidence limits can be determined by the normal approximation ˆ (t) ± 1.96σˆ (t) as shown in Figure 12.3A. A better estimate that avoids negative formula M M limits is obtained by applying the previously described (Section 7.1) logit transformation

464

FIGURE 12.2A Calculations for variance estimates.

Applied Reliability

465

Repairable Systems Part I

t 0

63

91

125

195

222

273

323

325

427

V(t) 0 0.032 0.048 0.048 0.112 0.080 0.032 0.048 0.128 0.112

442

584

636

761

0.112 0.084 0.084 0.191

Stdev 0 0.179 0.219 0.219 0.335 0.283 0.179 0.219 0.358 0.335 0.335 0.290 0.290 0.437 M(t) 0 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800 1.800 2.050 2.050 2.383 Normal Approx 0 –0.151 −0.029 0.171 0.144 0.446 0.849 0.971 0.899 1.144 LCL

1.144

1.482 1.482 1.526

Normal Approx 0 0.551 0.829 1.029 1.456 1.554 1.551 1.829 2.301 2.456 2.456 2.618 2.618 3.241 UCL w

5.7723 2.9257 2.0456 2.2703 1.7408 1.3393 1.359

1.55 1.4397 1.4397 1.319 1.319 1.4329

Logit Transformation LCL

0.0346 0.1367 0.2933 0.3524 0.5744 0.896 1.0302 1.0322 1.2503 1.2503 1.5542 1.5542 1.6633

Logit Transformation UCL

1.1545 1.1703 1.2274 1.8163 1.7408 1.6072 1.9026 2.48 2.5914 2.5914 2.704 2.704 3.415

FIGURE 12.3A MCF variance, standard error, and confidence limits.

ˆ (t)), LCL = M ˆ (t)/σˆ (t), and (Meeker and Escobar 1998), where w(t) = exp(1.96 * σˆ M (t)/M M ˆ ˆ UCL = M(t)σ M (t). EXERCISE 12.1A Using the spreadsheet approach, estimate the MCF, variance, and 95% confidence limits (both normal approximation and logit transformation limits) for the data in Table 12.1A for all event times.

12.2A Cox F-Test for Comparing Two Exponential Means As described by Cox and Lewis (1966) for an exponential distribution with parameter θ, if the random variable T is the time of occurrence of the rth event, then 2T/θ = 2 r θˆ/θ is distributed as a chi-squared variable with 2r degrees of freedom. The distribution of the ratio of two chi-squared variables is the variance ratio or F distribution. For two independent exponential distributions, the ratio θˆ 2 θ1 θˆ 1θ2 has an F distribution with (2 r 2 , 2 r 1 ) degrees of freedom. Thus, we can test the hypothesis of equal exponential means, that is, θ1 = θ2 , by constructing a 100(1 − α)% confidence interval for the ratio θ1 θ2 :

θˆ 1 θ1 θˆ 1 F < < Fα/2, 2 r2, 2 r1 1 − α / 2 , 2 r , 2 r 2 1 θ2 θˆ 2 θˆ 2

If the confidence interval excludes the value 1, the two exponential means are said to be different at the 100α% level of significance. These limits are approximate for time-censored data.

466

Applied Reliability

EXAMPLE 12.1A THE COX F-TEST Using the data from Table 12.6, perform the Cox F-test on the equality of the means.

Solution The data are time censored. The MTTFs and failure counts are

θˆ 1 = 19374.2

r1 = 5

θˆ 2 = 6550.2

r2 = 14

The ratio is θˆ 1 = 2.958 θˆ 2

For an approximate single-sided test, we can compare this ratio to the α = 0.05 percentile of the F distribution with 28 and 10 degrees of freedom. The spreadsheet function that provides the percentiles of the F distribution is FINV(0.05,28,10), which gives us a value of 2.710. Since the ratio θˆ 1/θˆ 2 >2.710, we conclude that there is a statistically significant difference at the 95% confidence level. A 95% two-sided confidence interval, using the spreadsheet functions FINV(0.975,28,10) = 0.393 and FINV(0.025,28,10) = 3.327, is then ( 2.958)(0.393) <

θ1 < ( 2.958)(3.327) θ2

or

1.161 <

θ1 < 9.840 θ2

Note that the interval excludes the value 1, which indicates rejection of the hypothesis of equality of the means. EXERCISE 12.2A Apply the Cox F-test to the data in Exercise 12.14. EXERCISE 12.3A Apply the Cox F-test to the data in Exercise 12.15.

12.3A Alternative Approach for Estimating CDF Using the Fundamental Renewal Equation We begin by writing the fundamental renewal equation in the equivalent form t

F(t) = M(t) +

∫ M(x)dF(t − x) 0

467

Repairable Systems Part I

By numerically solving the this equation, Trindade and Haugh (1979, 1980) show that the following equations can be used to estimate the CDF at a given time ti:

ˆ (t ) Fˆ (t1 ) = M 1

ˆ (t ) − Fˆ (t − t ) M ˆ (t ) Fˆ (t2 ) = M 2 2 1 1

We note at the second failure time t2 that Fˆ (t2 − t1 ) is required and t2 − t1 may be greater than, equal to, or less than t1. By using the information that Fˆ (t1 ) is specified at t1, we can develop a recursive approach such that Fˆ (t2 − t1 ) = 0,

if t2 − t1 < t1

= Fˆ (t1 ) if t2 − t1 ≥ t1

In words, if the time t2 − t1 between the first and second renewal is less than the time t1 to the first renewal, we set the CDF estimate for the time t2 − t1 equal to zero; if the between time is greater than t1, the CDF estimate is set equal to Fˆ (t1 ). See Figure 12.4A for a graphical representation of this procedure. Similarly, the component CDF based on the third renewal is given by ˆ (t ) ˆ (t ) − Fˆ (t − t )M ˆ (t ) − Fˆ (t − t ) M ˆ (t ) − M Fˆ (t3 ) = M 3 3 1 1 3 2 2 1

where the CDF for times between renewals is estimated by Fˆ (t3 − t j ) = 0,

if t3 − t j < t1

= Fˆ (t1 ) if t1 ≤ t3 − t j < t2

= Fˆ (t2 ) if t2 ≤ t3 − t j for j = 1,2. Thus, we see that differences, t3 − t j , between the given renewal time t3 and each earlier time t j, for j = 1,2, are compared to the individual renewal times in order to assign a proper Fˆ (t3 − t j ) value from the previously calculated Fˆ (t) estimates. ∧ F (t) ∧ F (t2)

∧ F (t)

∧ F (t3 − t2) = 0 ∧ ∧ F (t3 − t1) = F (t1)

∧ F (t2)

∧ F (t1)

∧ F (t)

∧ ∧ F (t3 − t2) = F (t1) ∧ ∧ F (t3 − t1) = F (t2)

∧ F (t2)

∧ F (t1)

0

t1 (t3 − t2)

t2 t3

(t3 − t1)

t

∧ ∧ F (t3 − t2) = F (t2) ∧ ∧ F (t3 − t1) = F (t2)

∧ F (t1)

t1 (t3 − t2)

t2 (t3 − t1)

FIGURE 12.4A Possible outcomes for time differences in renewal estimation.

t3

t

t1

t2 (t3 − t2)

t3 (t3 − t1)

t

468

Applied Reliability

We will give the general equation for the estimate of the component CDF Fˆ (tk ) at each system renewal time tk , but as we showed above, the formulas can be written without using such detailed notation shown below. So for k ≥ 2, we have

ˆ (t ) − Fˆ (tk ) = M k

k −1

∑ Fˆ (t

k

ˆ (t ) − M ˆ (t ) − tj ) M j j −1

j=1

ˆ (0) = 0. We develop recursively the CDF for times for 1 ≤ j ≤ k − 1, with t0 = 0, and Fˆ (0) = M between renewals as Fˆ (tk − t j ) = 0,

if tk − t j < t1 ,

= Fˆ (ti )

if ti ≤ tk − t j < ti+1 ,

= Fˆ (tk −1 ) if tk −1 ≤ tk − t j The comparisons of time differences to observed failure times to locate the proper recursive values for evaluating the CDF at time t can involve considerable computation and bookkeeping effort as the number of failures increases. Similar to the method presented in Section 12.9, Nelson (2003) describes an approach that avoids the recursive comparisons associated with this method. Trindade (1980) also presents several procedures to reduce accounting and speed up computations. These methods divide the total time of study (T) into k equally spaced intervals to numerically deconvolve the fundamental renewal equation. The choice of k is arbitrary and can vary for different domains of the total time T. For a single domain, the right-hand endpoints of these intervals will be T/k, T/k, … , nT/k, … , T, or in general, nT/k for n = 1, 2, … , k. The value of M(t) during the integral can be approximated at the beginning, midpoint, or interval end, corresponding to a minimum, linearly interpolated, or maximum value for M(t). Nelson’s (2003) approach does not require equally spaced intervals. EXAMPLE 12.2A RENEWAL DATA CALCULATION OF CDF Consider a system of 25 components. System failures were reported at 35, 79, 142, and 206 hours. Estimate the component CDF F (t ).

Solution The renewal function estimates are ˆ (35) = 1/ 25 = 0.04 M

ˆ (79) = 2/ 25 = 0.08 M ˆ (142) = 3/ 25 = 0.12 M ˆ ( 206) = 4/ 25 = 0.16 M

Repairable Systems Part I

469

The CDF estimates for the recursive approach are ˆ (35) = 0.04 Fˆ(35) = M

ˆ (79) − Fˆ(44)M ˆ (35) = 0.078 Fˆ(79) = M ˆ (79) − M ˆ (35)] = 0.115 ˆ (142) − Fˆ (107) Mˆ (35) − Fˆ(63)[M Fˆ(142) = M ˆ (35) − F (127)[M ˆ (79) − M ˆ (35)] − Fˆ(64)[M ˆ (142) − M ˆ (79)] = 0.151 ˆ ( 206) − Fˆ (171) M Fˆ( 206) = M EXERCISE 12.4A Consider, using the data in Example A12.1, a fifth renewal at 306 hours. Write the expression for Fˆ (t5 ). Solve for Fˆ (306).

Problems 12.1 Simulate 10 pseudorandom times from a lognormal distribution having mean 100 and sigma 1.5, using the methods in Chapter 5, and keeping the order of generation intact. Treat the sequential times as consecutive interarrival times of a system undergoing repair. Determine the system ages at repair. Make a line sketch of the repair pattern. Make a cumulative plot. Make a plot of the interarrival times versus system age. Make a lognormal probability plot of the interarrival times. Estimate the mean and sigma of the data. Plot the fitted model against the empirical CDF. 12.2 Repeat Problem 12.1 using 10 simulated values from a Weibull distribution, using the procedures in Chapter 4, with characteristic life of 50 hours and shape parameter 1.5. 12.3 The following repair times (in hours) are recorded on four similar systems: System 1: 197, 241, 368, 874, 927 System 2: 56, 119, 173, 506, 771 System 3: 99, 316, 414, 663 System 4: 156, 222, 375 The system censoring times are 1000, 800, 700, and 400 hours, respectively. Estimate the mean cumulative repair function and include in a cumulative plot with the data from the four systems. 12.4 Components are assumed to fail according to an exponential distribution with failure rate of 0.01%/hour. The mission duration is 7500 hours. How many spare parts must be carried to assure at least 99% probability of mission success? State any assumptions you make. What is the expected waiting time to the first, second, and third repairs? 12.5 Treat the four system repair times in Problem 12.3 above as the individual renewal times of four components in a four-component system during the first

470

Applied Reliability

400 hours of operation. Exclude times greater than 400 hours. Estimate the renewal function at each repair point. Estimate the system renewal rate. Make a sketch of the superposition process. 12.6 Using the system renewal data from Problem 12.5, estimate the CDF of the components from the cumulative renewal function.

13 Repairable Systems Part II: Nonrenewal Processes In Chapter 13, we presented analysis techniques for repairable systems, focusing primarily on renewal processes. The times between failures were independent and identically distributed (i.i.d.) observations from a single population. If the i.i.d. assumptions do not hold, the renewal process is not a suitable model. Consequently, we need other analysis methods that deal with more general patterns of sequential repair times. For example, if a trend is present, such information would be useful for both the user and the manufacturer in determining if reliability objectives are being met. One may also ask what design or operation factors influence repair frequency, whether maintenance schedules are appropriate, what warranty costs are anticipated, whether the provisions for spare parts are adequate, and so on. In many real-life instances, the time to a subsequent repair is generally a function of many variables including the basic system design, the operating conditions, the environment, and the quality of the repairs (the materials used, the competency of the technician, etc.). Thus, there is a genuine possibility of nonrenewal processes in which interrepair times are not independent or identically distributed. For multicomponent systems and different types of repair actions, the renewal model becomes even less plausible. This chapter continues the development of key concepts for analyzing data from systems subject to repair. We present both graphical and analytical procedures. We consider specific models and goodness-of-fit tests. Methods for detecting trend are presented. Simulation methods are covered. The topics of reliability growth and associated models are also discussed. This chapter deals with the topic of nonrenewal repairable reliability, an area that has many important and practical applications.

13.1 Graphical Analysis of Nonrenewal Processes As in Chapter 12, we consider a single system for which the times to make repairs are ignored. Suppose that consecutive times to repair are observed at the following system ages: 20, 41, 67, 110, 159, 214, 281, 387, 503, and 660. A line sketch shows the pattern of repairs (Figure 13.1). For this set of data, the cumulative plot is Figure 13.2. Figure 13.2 shows a flattening derivative of the curve as time increases, indicating a decreasing frequency of repairs (i.e., an improving repair rate). It seems obvious that there is a trend of longer times between repairs. Yet, when we look at the times between consecutive failures, we get 20, 21, 26, 43, 49, 55, 67, 106, 116, and 157.

471

472

Applied Reliability

0

100

200

300 400 500 System age (hours)

600

700

FIGURE 13.1 Dot plot of repair pattern.

11

Cumulative number of repairs

10 9 8 7 6 5 4 3 2 1 0

0

100

200

300 400 System age (hours)

500

600

700

FIGURE 13.2 Cumulative plot (improving trend).

These are exactly the same interarrival times—in a different order—that we observed in the renewal case in Chapter 12 (see Figure 12.2). Nevertheless, it would be inappropriate to analyze the data in the same way as we did previously for the renewal case because now the interarrival times are not independent random observations from a single distribution. There is a trend. We cannot neglect the chronological order of the data. The plot of the interarrival times versus the system age is shown in Figure 13.3. The graph makes the trend obvious. On the other hand, suppose the repairs occur at the following times: 157, 273, 379, 446, 501, 550, 593, 619, 640, and 660, with the associated line sketch shown in Figure 13.4. Then, the consecutive times between repairs are 157, 116, 106, 67, 55, 49, 43, 26, 21, and 20. Again these are exactly the same interrepair times—in a different order—as in the two previous cases. However, the order now shows an increasing frequency, indicating a degradation in system reliability. The cumulative plot is Figure 13.5. The plot is curving monotonically upward in time, toward a higher frequency of repairs. A trend is present. Again, it is not correct to analyze the time between repairs as if they were independent observations from a single distribution as we did for the renewal process. Chronological order of the data is important. Analyses specific to repairable systems must be used. The plot of interarrival times versus system age is also very revealing of the degradation trend. We show the chart in Figure 13.6.

473

Repairable Systems Part II: Nonrenewal Processes

180 160 140

Xi (hours)

120 100 80 60 40 20 0

0

100

200

300 400 System age (hours)

500

600

700

FIGURE 13.3 Interarrival times versus system age (improving trend).

0

100

200

300 400 500 System age (hours)

600

700

FIGURE 13.4 Dot plot of repair pattern.

11 Cumulative number of repairs

10 9 8 7 6 5 4 3 2 1 0

0

100

FIGURE 13.5 Cumulative plot (degrading trend).

200

300 400 System age (hours)

500

600

700

474

Applied Reliability

180 160 140

Xi (hours)

120 100 80 60 40 20 0

0

100

200

300 400 System age (hours)

500

600

700

FIGURE 13.6 Interarrival times versus system age (degrading trend).

13.2 Two Models for a Nonrenewal Process To illustrate one possible approach to analyzing data from a nonrenewal process, let us attempt to model the cumulative repair function for the degradation data of Figure 13.4 using a simple equation with curvature. Two possibilities are the power relation

M(t) = at b

and the exponential relation

M(t) = ae bt

where a and b are empirically estimated constants. These equations might adequately model monotonically increasing or decreasing functions. Obviously, many other models are possible. Also, be aware that we are fitting a smooth curve, M(t), to a step function ˆ (t), similar to the relationship between an empirical distribution function and estimate, M its theoretical CDF. The first step is to rectify the equations (linear rectification is described in Chapter 6). For the power relation, we take logarithms of each side to get

ln M(t) = ln a + b ln t

ˆ (t) versus ln t on linearfrom which we see that, if the model is adequate, the plot of ln M ˆ by-linear paper (or M(t) vs. t on log–log paper) should approximate a straight line with intercept ln a and slope b. The data are plotted in Figure 13.7. The plot appears to show significant curvature, indicating a potentially poor fit of the power relation model to the data.

475

Repairable Systems Part II: Nonrenewal Processes

M(t)

10

1

0.1 100

System age t (hours)

1000

FIGURE 13.7 Power law model rectification.

Let us consider an alternative formula: the exponential relationship. Taking logarithms of both sides yields

ln M(t) = ln a + bt

ˆ (t) versus t on linear paper (or M ˆ (t) versus t on If the model is adequate, the plot of ln M semilog paper) should approximate a straight line with intercept ln a and slope b. The data are plotted in Figure 13.8. The plot is a nearly straight line, indicating a reasonable fit. A least squares regression (a practical but only approximate method since usual least squares assumptions are not fulfilled) on the data in Figure 13.8 produced the following estimates for the empirical parameters: a = 0.556, b = 0.00434. The actual cumulative plot with the fitted model equation ˆ (t) = 0.556e 0.00434t is shown in Figure 13.9. The model appears to provide an acceptable fit. M What does the fitted model represent? The model is an estimate of the (population) mean number of repairs per system at a specified time. Obviously, the system is getting worse. Although future values of the repair function may be extrapolated forward, we may be more interested in the question, “What is the cause of the deterioration?” The model for a single system does not tell us enough about the population distribution ˆ (t) of the of the cumulative number of repairs at that time. How accurate is this estimate M expected (population) number of repairs by time t? What is the discrete distribution of the number of repairs at time t? Additional analytical tools are needed to address these questions. We shall discuss these matters further in the remaining sections of this chapter. In summary, these examples illustrate that, when presented with system repair data, one must be careful about making an assumption of a renewal process without justification. Ascher (1981) provides a similar approach to the Weibull analysis of interarrival data. Checking for trends, independence, and distribution properties in the data is important for proper analysis.

476

Applied Reliability

M(t)

10

1

0.1

0

100

200

300 400 System age t (hours)

500

600

700

FIGURE 13.8 Exponential model rectification. 12 10 Model

M(t)

8 6

Actual

4 2 0

0

100

200

300 400 System age t (hours)

500

600

700

FIGURE 13.9 Exponential model fit.

EXERCISE 13.1 Perform an analysis of the improving system data in Figure 13.2. Attempt to fit power relation and exponential models to the data. Make the rectification plots. Decide on the better model and estimate the empirical parameters. Plot the model fit with the actual data in a cumulative plot. Comment on the results. Can you suggest an alternative model with a better fit?

Repairable Systems Part II: Nonrenewal Processes

477

13.3 Testing for Trends and Randomness If we are developing a model, we should verify the assumptions that justify using that model. In Chapter 12, we discussed renewal processes that involve a sequence of i.i.d. interarrival times. A renewal process is also a stationary process, that is, the number of events in an interval depends only on the length of the interval and not on the starting point. If any trend is present, the process is nonstationary and not a renewal process. Conversely, we will assume that a process that has no trend is stationary. (Exceptions are possible but not common in practice.) Thus, the first step in the analysis of repairable system data is to check for a trend, since a trend rules out a common distribution for all the interarrival times. How might trends or nonstationarity develop? For repairable systems, the environment, usage, part supplies, maintenance, and so on may be changing. For example, if a failed part is replaced with a new part from a time or vintage of manufacture different than that of the original, then the population of replacements may have a different time to failure distribution. Consequently, we could have a nonstationary process with independent but not identically distributed arrival times. In other situations, old systems may not receive as much attention as new systems, thereby affecting the repair frequency. A system may have a subsystem, such as a cooling fan, that degrades and causes deterioration in the interarrival times. Individuals performing repairs or maintenance may learn how to do the job better and thus improve the subsequent repair frequency. The environment may be changing. Thus, there are many possible causes of trends. Another example of nonstationarity mentioned in Chapter 12 comes from Davis (1952). He analyzed the number of miles between bus engine repairs by comparing histograms of the kth interarrival times among systems, for example, a histogram of the time to first repair, the times between first and second repairs, and so on. He saw that the times to first failure were fairly normally distributed but later interarrival times tended to be more skewed, appearing nearly exponential. For single or multiple copies of a system, how do we statistically assess if interarrival times have any trend, that is, degradation or improvement? We have already mentioned that the first action should be to plot the data. A cumulative plot of number of repairs or the estimated mean cumulative repair function M(t) versus system age is often revealing of trends. Plots of interarrival times versus age are also very useful. Graphs are informative, especially for checking model adequacy, but development of analytical models allows for formalized decisions, data summary, predictions, and so on. What are the steps one should follow in the statistical analysis of repairable system data? For chronologically ordered times between repairs, one first investigates if a trend is present. Several alternative considerations are possible in checking for a trend: Are we doing a general test for any trend or a test against a specific type of trend? We will present several options in this section. If a trend is indicated, a nonstationary model is used. We will present several nonstationary models in Section 13.4, the most i mportant being the nonhomogeneous Poisson process (NHPP). Under stationarity, the interarrival times are identically distributed (see Cox and Lewis 1966), but not necessarily independent. In this case of no trend, one then examines whether observations are independent. To check for independence, concepts from time series analysis have been suggested in the literature. For example, scatter plots of interarrival times versus the immediately

478

Applied Reliability

preceding interarrival time (lag 1) or earlier lags are used to check for serial correlation in the data. However, because of the typically skewed nature of reliability data, such plots are often not very revealing. In addition, large amounts of data are generally required for time series approaches. For further information on this topic, see the discussion in Cox and Lewis (1966) or the example in Crowder et al. (1991). Ascher and Feingold (1984) point out that even if there is dependence, it is still meaningful to reorder the interarrival times by magnitude and estimate the properties of their common distribution. For a stationary process, if the interarrival times are i.i.d., then a renewal process exists. If a renewal process, one should consider the form of the distribution of the times. In particular, is the Poisson process (homogeneous Poisson process [HPP]) appropriate? If an HPP is rejected, other distributions are possible model candidates or one may resort to distribution-free techniques. Ascher and Feingold (1984) provide a more complete overview of statistical analysis procedures for repairable systems failure data. We mention that simple transformations may change the nature of the data. Also, using a different scale other than clock or calendar time (e.g., number of units produced, length of material consumed, etc.) may affect the characteristics of the process. 13.3.1 Other Graphical Tools We have already discussed in Section 13.1 the cumulative plot and the plot of interarrival times against system age as valuable graphical tools for the analysis of failure data from repairable systems. Another approach, applicable to a single system (also extendable to a pooled estimate M(t)), is to analyze the reciprocals of successive interarrival times to determine if a trend is present. This charting procedure is based on the rate of repairs m(t), that is, the derivative of the expected cumulative number of repairs in time. The rate of repairs is also known as the rate of occurrence of failure (ROCOF). Recall that

m(t) =

dM(t) dt

An interarrival time measures a period between repairs, and taking the reciprocal is equivalent to estimating an average rate between consecutive repairs. Plotting these average repair rates (ARR) versus total elapsed times can quickly reveal periods of increasing, decreasing, highly varying, or nearly constant repair frequencies. We refer to such charts as the ARR plots. Figures 13.10 through 13.12 show the ARR for the three sets of repair data (constant, improving, and degrading) previously presented. The ARR charts visually confirm steady, decreasing, and increasing repair rates. When sufficient data are available, convenient, equal-length time intervals may be selected, and the number of failures during each interval divided by the interval length to obtain an average ROCOF plot. For smoothed repair rate plots, numerical differentiation techniques may be applied directly to cumulative plot curves as described by Trindade and Haugh (1979). An alternative smoothing procedure is described by Barlow, Proschan, and Scheuer (1969). EXERCISE 13.2 Using the data from Exercise 13.1, make an ARR plot. Is there evidence of any trend?

479

Repairable Systems Part II: Nonrenewal Processes

0.06

1/Xi (repairs/hour)

0.05 0.04 0.03 0.02 0.01 0

0

100

200

300 400 System age (hours)

500

600

700

300 400 System age (hours)

500

600

700

300 400 500 System age (hours)

600

700

FIGURE 13.10 Average repair rates versus time (renewal data).

0.06

1/Xi (repairs/hour)

0.05 0.04 0.03 0.02 0.01 0

0

100

200

FIGURE 13.11 Average repair rates versus time (improving).

0.06

1/Xi (repairs/hour)

0.05 0.04 0.03 0.02 0.01 0

0

100

200

FIGURE 13.12 Average repair rates versus time (degrading).

480

Applied Reliability

13.4 Laplace Test for Trend It is apparent that there are two different approaches in the analysis of trends for a series of events. One may be interested only in whether a trend of any type is present. Nonparametric procedures, as discussed in Sections 13.4 through 13.6, may be useful here. Alternatively, one may check the trend against a specific model for the data. (For a full discussion, see Cox and Lewis 1966.) In the latter case, the chosen model will have an important effect in establishing the significance of any trend. For example, a simple (parametric) way to check for trend involves testing whether or not an observed series of events is of Poisson type. The procedure tests whether observations come from a sequence of i.i.d. exponential random variables. Rejection implies that the HPP is not an appropriate model for the data, that is, we accept the alternative that some type of trend is present. Let us see how such a test may be constructed. Suppose we observe n repairs for a single system during the interval 0 to T1 (i.e., we have time censoring). Ross (2003) and Parzen (1962) show that under the HPP, the first n system ages t11 , t12 , … , t1n at which repairs occur have the same distribution as if they were the order statistics corresponding to n independent random variables U 1 , U 2 , … , U n uniformly distributed on the same interval 0 to T1. We use the term order statistics in the sense that t11 is the smallest value among U 1 , U 2 , … , U n, and t12 is the second smallest value, and so on. Hence, one method of testing for an HPP is to test whether the observations U 1 , U 2 , … , U n are independent and uniformly distributed over the interval 0 to T1. According to the central limit theorem (see Chapter 5), for moderately large values of n, the sum n

Sn =

∑U

i

i=1

of n independent random variables, each uniformly distributed on the interval 0 to T1 , may be considered normally distributed with mean nT1/2 and variance nT12/12. Thus, if we see n = 10 repairs in T1 = 576 hours of observation, then the sum S10 of the times when the events occur is (approximately) normally distributed with mean 10(576)/2 = 2880 and standard deviation 525.8. Consequently, at a 5% level of significance, if the sum S10 satisfies the inequality

2880 − (1.96) 525.8 ≤ S10 ≤ 2880 + (1.96) 525.8

or

1849.4 ʺ S10 ʺ 3910.6

the HPP model is not rejected. This same procedure can be expressed in terms of the test statistic n

∑t

ij

Li =

j=1

−

nTi 2

Ti n 12

481

Repairable Systems Part II: Nonrenewal Processes

which approaches a standard normal variate under the HPP for moderately large n. In this form, the test is called the Laplace or centroid test. Bates (1955) has shown the approximation adequate for the 5% level of significance for n at least equal to 4. If we have repair series data from k independent systems, each system observed for time Ti , i = 1, 2 , ... , k (i.e., time censoring), with repair counts of n1 , n2 , … , nk and repair times for the ith system given by tij , then we can test each series individually or we can pool the data into a single test statistic: n1

∑t + ∑t 1j

L=

nk

n2

j=1

2j

++

j=1

∑t

kj

−

j=1

1 ( n1T1 + n2T2 + + nk Tk ) 2

1 ( n1T12 + n2T22 + + nk Tk2 ) 12

where L has very nearly a standard normal distribution. For failure-censored data, the above test statistics are modified by replacing each ni by ni − 1 (including summations) and setting each system period of observation Ti equal to the time of the nith failure. Note that all systems may have different repair rates; we are testing the null hypothesis of an HPP and not the equality of rates. EXAMPLE 13.1 LAPLACE TEST FOR TREND VERSUS A POISSON PROCESS Three repairable systems have experienced the following consecutive interarrival times: System 1: 98, 150, 37, 62, 15 System 2: 15, 77, 96, 14, 12 System 3: 55, 80, 48, 37 The periods of observation were 375, 220, and 225 hours, respectively. Is there any evidence against the null hypothesis of an HPP for each system? Is there any evidence based on the overall test?

Solution For system 1, the repair ages are 98, 248, 285, 347, and 362 hours. The sum of the system ages is 1340 hours. Hence, the test statistic is

L1 =

1340 − 5(375) 2 375 512

= 1.66

Reference to a standard normal table shows a two-sided level of significance equal to 2 × (4.8%) = 9.6%. Similarly, the test statistics for the remaining two systems are L2 = 1.13 and L3 = 1.10, with respective two-sided significance levels of 26% and 27%. Based on these results, at the 10% significance level, there is evidence against the HPP for system 1, but not for systems 2 and 3. The overall test statistic is

L=

1 [5(375) + 5( 220) + 4( 225)] 2 = 2.28 1 5(375)2 + 5( 220)2 + 4( 225)2 ] [ 12

1340 + 711 + 593 −

with a two-sided significance level of 2.2%. Collectively, there is fairly strong evidence against the HPP as an appropriate model for the system repair data.

482

Applied Reliability

EXERCISE 13.3 Perform the Laplace test individually on the three sets of data of Figures 12.2, 13.2, and 13.5, respectively. EXERCISE 13.4 Perform the pooled Laplace test on the data from Example 13.1, assuming failure censoring at the fourth, fifth, and third repairs, respectively, for the three systems.

13.5 Reverse Arrangement Test In testing against the HPP, which is a special type of renewal process, we did not first verify the assumption that the interarrival times were i.i.d. However, the test statistic we used is based on the i.i.d. premise. An easy-to-understand, more general test that distinguishes between i.d.d. interarrival times (i.e., a renewal process) and a monotonic trend will now be described. The test has been applied by De Le Mare (1991). The procedure is a nonparametric reverse arrangement test (RAT) originally devised by Kendall (1938) and further developed into a table by Mann (1945). Consider a set of n interarrival times occurring in the sequence

X1 , X 2 , … , X n

Starting from left to right, a reversal is defined as each instance when a time value X i in the sequence is less than any of the subsequent time values X i+1, X i+ 2 , ..., X n, that is, a reversal occurs each time the following inequality holds:

X i < X j for i < j, where i = 1, … , n − 1 and j = i + 1, … , n

For example, say a system has repairs at ages 25, 175, 250, and 350 hours. In the sequence of interarrival times 25, 150, 75, and 100, there are a total of 3 + 0 + 1 = 4 reversals since 25 is less than the three values of 150, 75, and 100, 150 is not less than any other number, and 75 is less than 100. For n items, what is the probability of a given number of reversals occurring by chance alone? A larger than expected number of reversals, associated with growing X i, would indicate an increasing lifetime trend (less frequent repairs). Conversely, a smaller than expected number is consistent with diminishing X i and a decreasing trend (more frequent repairs). A tie does not count as a reversal. EXERCISE 13.5 What is the number of reversals in the sequence 100, 200, 150, 50, 100, and 125?

The calculation of reversal probabilities is fairly simple. The probability of a reversal is found by first determining the permutations of n quantitative objects and then counting the number of reversals for each permutation. For n variables, there are n! permutations possible (see Chapter 11). The minimum number of reversals is 0, and the maximum number is the series

n− 1

(n − 1) + (n − 2) + + 1 =

∑ (n − i) = n(n2− 1) i=1

Repairable Systems Part II: Nonrenewal Processes

483

Consider the case of n = 4. Designate the sequence of times between failures as the four observations X 1 , X 2 , X 3 , X 4. There are 4! = 24 possible permutations. The maximum number of reversals is 4(4 − 1)/2 = 6. We can easily show (see Exercise 13.7) that the probabilities of 0, 1, 2, 3, 4, 5, and 6 reversals occurring by chance are (1/24), (3/24), (5/24), (6/24), (5/24), (3/24), and (1/24), respectively. The cumulative probabilities are thus .042, .167, .375, .625, .833, .958, and 1.000, respectively. The derivation of these probabilities is straightforward. The sequence X 1 < X 2 < X 3 < X 4 is the only permutation of (X1 , X 2 , X 3 , X 4 ), which has six reversals. For example, the sequence (5, 10, 15, 20) has six reversals. So the probability of no reversals occurring by chance is 1/24 = .042. There are three possible permutations that give one reversal (i.e., X 4 < X 3 < X 1 < X 2 , X 3 < X 4 < X 2 < X 1, and X 4 < X 2 < X 3 < X 1). For example, the sequences (15, 20, 10, 5), (20, 15, 5, 10), and (20, 10, 15, 5) each contains one reversal. So the probability of one reversal is 3/24 = .139. Consequently, there are four permutations that give either zero or one reversal for a cumulative probability of 4/24 = .167, and so on. EXERCISE 13.6 Consider the situation of three interarrival times. How many permutations of three objects exist? List each permutation and count the number of possible reversals for each. Determine the individual and cumulative probabilities for all possible reversals. EXERCISE 13.7 Consider the case of four interarrival times. How many permutations of four objects exist? List each permutation and count the number of possible reversals for each. Determine the individual and cumulative probabilities for all possible reversals. EXERCISE 13.8 Consider the case of five interarrival times. How many permutations of five objects exist? List each permutation and count the number of possible reversals for each. Determine the individual and cumulative probabilities for all possible reversals.

The RAT uses the total number of reversals in a sequence of interarrival times, comparing that number to what would be likely to occur based on pure chance. If the number of reversals is too small against some probability criteria, randomness is discredited and a downward trend is suggested. A test for an upward trend can similarly be made by testing whether the number of reversals is too large against the upper tail probability. Tables that show the probability of a given number of reversals in n items occurring by chance are available. (See De Le Mare 1991, which provides tables that correct some errors in Mann’s tables.) Kendall (1938) showed that the expected number of reversals is n(n − 1)/4 and the variance of the distribution of reversals is equal to (2n + 5)(n − 1)n/72. He also demonstrated that the distribution of the number of reversals R rapidly approaches the normal distribution with increasing n. Hence, the expression

1 − n(n − 1)/4 2 Z= (2 n + 5)(n − 1)n/72 R+

quickly approaches the standard normal distribution. (The 1/2 term is added so that the normal distribution will better approximate the probabilities for the discrete distribution of R.) Thus, in addition to using tables to determine the probabilities, we can use the normal approximation for even fairly moderate values of n. For example, Table 13.1 compares the exact probabilities versus the normal approximation for n = 4.

484

Applied Reliability

TABLE 13.1 Probability of R Reversals by Chance for n = 4 R Exact Normal Approximation

0

1

2

3

4

5

6

4.2% 4.5%

12.5% 10.9%

20.8% 21.3%

25.0% 26.6%

20.8% 21.3%

12.5% 10.9%

4.2% 3.6%

TABLE 13.2 Critical Values of Rn, % of the Number of Reversals for the Reverse Arrangement Test

Sample Size

Single-Sided Lower Significance Level (Too Few Reversals Provide Evidence of Degradation)

Single-Sided Upper Significance Level (Too Many Reversals Provide Evidence of Improvement)

n

1%

5%

10%

10%

5%

1%

4 5

0

0 1

0 1

6 9

6 9

10

6

1

2

3

12

13

14

7

2

4

5

16

17

19

8

4

6

8

20

22

24

9

6

9

11

25

27

30

10

9

12

14

31

33

36

11

12

16

18

37

39

43

12

16

20

23

43

46

50

Instead of calculating probabilities for all possible reversals, one is often interested in determining if an observed number of reversals is beyond random chance occurrence, indicating a possible trend toward decreasing or increasing interarrival times. Is the number of reversals statistically significant at some critical level? Table 13.2 provides the critical values for one-sided, upper and lower, statistical significance as a function of the observed sample size n. For example, if we collect a sequence of 10 interarrival times and observe 12 or fewer reversals, we would conclude that there is evidence of significant decreasing (degradation) of times at a one-sided significance level of 5%. Similarly, if 33 or more reversals are observed, we would suspect increasing times (improvement) at the one-sided 5% significance level. Note that the results above are symmetric for any row: subtracting the lower critical value from the maximum number of reversals gives the upper critical value for any matching level of significance α, that is,

Rn,upperα % =

n(n − 1) − Rn,lowerα % 2

For n greater than 12, critical values for the number of reversals can be found by solving inversely for the upper R value in the Kendall normal approximation equation, that is,

R = zcritical

(2 n + 5)(n − 1)n n(n − 1) 1 + − 72 4 2

485

Repairable Systems Part II: Nonrenewal Processes

For example, for n = 20, at an upper 5% significance level ( zcritical = 1.645), the critical value is roughly 120. The maximum number of reversals is 20(19)/2 = 190. The lower 5% critical value is 190 − 120 = 70. EXAMPLE 13.2 REVERSE ARRANGEMENT TEST A system experiences repairs at the following ages, in hours: 155, 335, 443, 583, 718, 815, 925, 1030, 1113, 1213, 1341, 1471, 1551, 1633, and 1748. Is there any statistically significant indication of a trend to shorter times between repairs?

Solution It is always a good idea to plot the data. Figure 13.13 is a cumulative plot of the data. Although there does appear to be upward curvature in the plot (a trend to shorter times), it is difficult to assess the statistical significance of the trend, if any, by viewing only the chart. We construct a second plot (Figure 13.14) of the 15 observed interarrival repair times X i , where i = 1, … , 15, versus the system age at repair. These times are 155, 180, 108, 140, 135, 97, 110, 105, 83, 100, 128, 130, 80, 82, and 115. This plot is more revealing of the apparent decreasing trend, but again, what is the likelihood of this data sequence occurring by chance? EXERCISE 13.9 Make an ARR plot (described in Section 13.3.1) using the reciprocal interarrival times of Example 13.2 versus the system age. Comment on the results. Applying the RAT test for randomness, we find 30 reversals in the sequence of X i . Is this value a statistically significant indication of a downward trend (i.e., decreasing times)? For n = 15, the expected number of reversals is 15(14)/4 = 52.5 and the variance is 102.08. The probability of 30 or fewer reversals is obtained by calculating z = (30 + 0.5 − 52.5)/10.1 = 2.178. The corresponding probability obtained from a standard normal table is .0147, or less than 1.5%. Thus, the low reversal count is a rare occurrence under chance alone. Since the number of reversals is fewer than expected, a trend toward shorter times between failures is suspected. In other words, the frequency of failures appears to be increasing. 16

Cumulative number of reparirs

14 12 10 8 6 4 2 0

0

FIGURE 13.13 Cumulative plot of repair data.

500

1000 System age (hours)

1500

2000

486

Applied Reliability

200 180

Interarrival times (hours)

160 140 120 100 80 60 40 20 0

0

500

1000 System age (hours)

1500

2000

FIGURE 13.14 Interarrival times versus system age.

EXERCISE 13.10 Apply the RAT individually to the three sets of repair data of Figures 12.2, 13.2, and 13.5, respectively (n = 10 for each set). What is the probability of a more extreme number of reversals occurring by chance alone for each set? Compare to the results in Exercise 13.3. EXERCISE 13.11 There are 25 interarrival times observed. The number of reversals is 98. Calculate the expected number of reversals. Is there statistical evidence of a trend? Is the trend toward shorter or longer interarrival times? Find the level of significance of any trend. EXERCISE 13.12 Determine the upper and lower critical values of reversals for the RAT at the 5% significance level for 15 observed interarrival times.

13.6 Combining Data from Several Tests If we have data on more than one system, applying the RAT to each system may be insufficient to detect a trend, but taken together, the systems might provide overall evidence of a trend. Alternatively, to combine data from all systems into one set of data, one may work with the mean cumulative repair function as previously described by Nelson (1988a), which has the added advantage of nonparametrically handling multicensored data. Let us consider a procedure for combining data from a number of systems in order to get a more sensitive test for detecting trend. The procedure is from Fisher (1954) and is described by De Le Mare (1991).

487

Repairable Systems Part II: Nonrenewal Processes

Suppose that, for three independent tests, none of them are individually significant at a prescribed level (say 5%). Fisher describes the following procedure: Take the natural logarithm of each significance level, change its sign, and double it. Values so obtained are χ 2 distributed with 2 degrees of freedom. Any number of these quantities may be added together for a composite test. We illustrate the method with an example. EXAMPLE 13.3 FISHER’S COMPOSITE TEST From Example 13.1, three repairable systems have experienced the following consecutive inter arrival times: System 1: 98, 150, 37, 62, 15 System 2: 15, 77, 96, 14, 12 System 3: 55, 80, 48, 37 We use the RAT method and calculate that the number of reversals for the three systems are 2, 3, and 1, with respective tail probabilities of chance occurrence equal to 14/120 = .117, 29/120 = .242, and 4/24 = .167 (see Exercises 13.7 and 13.8 for probability calculations). Alternatively, the normal approximation gives tail probabilities of .11, .23, and .15, respectively. To test whether the aggregate should be regarded as significant, we perform the numerical steps shown in Table 13.3. For 6 degrees of freedom, we have χ62;0.10 = 10.6, indicating a significance level less than 10% for the test on these data (which yielded a test statistic of 10.71). Alternatively, we can use the spreadsheet function CHIDIST(10.71,6) to return the result 9.8% for the significance level. (The normal approximation gives a significance level of 8.6%.) Thus, individually, there was insufficient data to conclude significance, but the composite test provides significance at the 10% level for the aggregate. There is some indication of a decreasing trend (i.e., shorter times between failures). Compare these results to the Laplace test conclusions in Example 13.1.

Note that the composite test procedure does not assume equivalent repair rates (or even—for two-sided p values—the same direction for trend) within the different systems; the procedure looks only at whether the individual sequences taken collectively exhibit any trend. To compare rates, a nonparametric procedure such as the Wilcoxon rank sum test for two populations or the Kruskal–Wallis test for three or more populations (see Ott and Longnecker (2001)) may be applied to the cumulative repair times. EXERCISE 13.13 The repair histories (ages in hours) on four failure-censored systems are shown below. System 1: 142, 309, 460 System 2: 99, 145, 300, 347 System 3: 212, 225, 273, 398, 467 System 4: 21, 58, 150, 176 TABLE 13.3 Steps for Fisher’s Composite Test P

−2 lnP

0.117 0.242 0.167

4.29 2.84 3.58 ––––––– 10.71

SUM

Degrees of Freedom 2 2 2 ––––– 6

488

Applied Reliability

Make a graph containing the cumulative plot for each system along with the mean cumulative repair function estimate (pooled summation approach). Using the RAT procedure, calculate the probabilities of the observed reversals for each system. Perform Fisher’s composite test on the aggregate. Do you believe all four systems have basically the same repair rate?

13.7 Nonhomogeneous Poisson Processes NHPP is a useful generalization of the Poisson process, which has wide applicability in the modeling of repairable systems (for examples, see Cox and Lewis 1966; Crow 1974, 1990, 1993; Lawless 2003; Bain and Engelhardt 1991; Rigdon and Basu 1989). The NHPP is also used in the modeling of software reliability (see Xie 1991; Musa, Iannino, and Okumoto 1987). Recall that the HPP has stationary increments, that is, the number of events that occur in any interval of time depends only on the length of the interval and not on the starting point of the interval. The interarrival times for an HPP are independent with the same exponential distribution, and the process has no memory. The NHPP permits nonstationary events by allowing the repair or recurrence rate λ at time t to be a function of age t. When the recurrence rate is a function of time, it is called the intensity function λ(t). Define the mean cumulative function t

M(t) =

∫ λ(τ) dτ 0

Then, it can be shown (see Ross 2003) that

P [ N (t + s) − N (t) = n] = e

− M( t + s )− M( t )

[ M(t + s) − M(t)]n n!

This equation states that the incremental number of occurrences, N (t + s) − N (t), in the interval of time from t to t + s has a Poisson distribution with a mean (or expected) number of occurrences equal to M(t + s) − M(t). Consequently, M(t) is called the mean value or mean repair function. If the intensity function is constant, that is, λ(t) = λ, then we have the HPP with M(t) = λt, and consequently, N (t + s) − N (t) is Poisson distributed with mean λ s. From the above expression, we see also that the reliability R( s), defined as the probability of zero occurrences in the time interval t to t + s, is

R( s) = e −[ M (t+ s)− M (t )] EXAMPLE 13.4 NONHOMOGENEOUS POISSON PROCESS Consider an NHPP with mean repair function modeled by the power relation

M(t ) = 0.01t1.5 What is the expected number of repairs between 80 and 100 hours, and between 180 and 200 hours? What is the reliability between 80 and 100 hours, and between 180 and 200 hours? What is the probability of at least two repairs between 80 and 100 hours, and between 180 and 200 hours?

Repairable Systems Part II: Nonrenewal Processes

489

Solution

( ) = 7.16, M(100) = 0.01100 Note M(80) = 0.0180 ( )1.5 = 10.00, M(180) = 0.01180 ( )1.5 = 24.15, and M( 200) = 0.01( 200)1.5 = 28.28 . Hence, the expected number of repairs between 80 and 100 hours is M(100) − M(80) = 10.0 − 7.16 = 2.84 and between 180 and 200 hours is M( 200) − M(180) = 28.28 − 24.15 = 4 ( 200) − M(180) = 28.28 − 24.15 = 4.13. Since the number of repairs between 80 and 100 hours is Poisson with mean 2.84, the reliability is the probability that the number of repairs is zero or 1.5

P [ N(100) − N(80) = 0 ] = e −[M (100 )−M (80 )] = e −2.84 = 0.0582 Alternatively, we can use the spreadsheet function POISSON(0,2.84,0). Similarly, the reliability between 180 and 200 hours is

P [ N( 200) − N(180) = 0 ] = e −[M ( 200 )−M (180 )] = e −4.13 = 0.0161 The spreadsheet function POISSON(0,4.13,0) returns the same result. The probability of at least two repairs between 80 and 100 hours is 1 − {P[0] + P[1]}, that is,

P[N(100) − N(80) ≥ 2] = 1 − P[N(100) − N(80) ≤ 1] = 1 − [ e −22.84 + 2.84e −2.84 ] = 0.776 The spreadsheet function 1-POISSON(0,2.84,1) will also yield the same outcome. Similarly, between 180 and 200 hours, the probability of at least two repairs is

P[N( 200) − N(180) ≥ 2] = 1− P[N( 200) − N(180) ≤ 1] = 1− [ e −4.13 + 4.13e −4.13 ] = 0.917 Again, the spreadsheet function 1-POISSON(0,4.13,1) may be used. EXERCISE 13.14 For the model in Example 13.4, find the expected number of repairs and the reliability between 280 and 300 hours. Find the probability of at least two repairs between 280 and 300 hours.

13.8 Models for the Intensity Function of an NHPP What are some possible models for the intensity function of an NHPP? We consider two commonly applied, time-dependent models for the mean repair function. 13.8.1 Power Relation Model The power relation model may be written as

M(t) = at b

Its intensity function is

λ(t) =

dM(t) = abt b−1 dt

490

Applied Reliability

Note that if 0 < b < 1, the intensity function is decreasing, that is, the ROCOF is improving. If b > 1, as in Example 13.4, the rate is increasing. We note the similarity of the intensity function to the formula for the hazard rate of the Weibull distribution. For this reason, such processes have been inappropriately and confusingly called Weibull processes in the reliability literature. However, we caution against using this terminology because the power relation process is not based on the Weibull distribution and procedures applicable to the analysis of Weibull data are not correct here. For an NHPP with intensity rate modeled by the power relation, how do we estimate the model parameters? Earlier in this chapter, we suggested a graphical technique using rectification and an eyeball fit or possibly least squares regression (only as an approximation). Crow (1974) has developed maximum likelihood estimates (MLEs) for the power model. MLEs exist for two different forms of data truncation: by failure count or by fixed time. Suppose a single system experiences n repairs at system ages ti , i = 1, 2, … , n. If the data are censored at the nth failure, the number of repairs is fixed at n, but the time tn to the nth failure is random. For this failure-censored situation, conditioned on the system age tn , the modified MLEs for the power model (modified to provide unbiased bˆ ) are

bˆ =

n−2 n− 1

tn ti

∑ ln i=1

, aˆ =

n ˆ tnb

If, instead of failure-censored data, we have data censored at a fixed time T (consequently, the number of failures N by time T is random), then conditioned on the number of repairs ˆ for the power model are N, the time-censored modified MLEs (unbiased b)

bˆ =

N −1 N

, aˆ =

T ti

∑ ln i=1

N ˆ Tb

If we have time- or failure-censored data on k copies of the system operating under similar conditions, the data may be combined to improve the estimates of the model parameters. Let Tq denote the censoring time for the qth system, q = 1, 2, … , k. Let nq denote the total number of failures on the qth system by time Tq. Let tiq denote the system age for the ith repair on the qth system. In order to combine both time- and failure-censored data, we introduce a new variable N q, which equals nq if the data on the qth system are time censored or equals nq −1 if the data on the qth system are failure censored. Crow (1974) shows that, conditioned on either the number of repairs for each system or on the censoring time for each system, the unbiased modified MLE for b can be expressed in closed form as

bˆ =

Ns − 1 k

Nq

∑ ∑ ln q=1 i=1

Tq tiq

where k

NS =

∑N q=1

q

491

Repairable Systems Part II: Nonrenewal Processes

The modified MLE (pooled) for a (not necessarily unbiased) is k

∑n

q

aˆ =

q=1 k

∑T

bˆ q

q=1

We have estimated the parameters a and b as common to all systems, that is, the systems are samples from a population having a single mean cumulative repair function M(t). However, several other distinct possibilities may exist. For example, the parameter b may be common to all systems, but the parameter a may be different for each system. In that case, a would be estimated separately for each system using aˆ q =

nq ˆ

Tqb

where bˆ is the common estimate. Alternatively, we may wish to test the hypothesis that the parameter b is the same for each system. We may desire confidence intervals on the para meter a or b. (Confidence intervals involving the power law relation are discussed in Section 13.10.) There may be other censoring situations (e.g., systems with left-censored data and different censoring times). Thus, there are many possible tests depending on the available data or desired objectives. For further considerations and examples, consult Crow (1974, 1990). For confidence intervals on the intensity function for the power model, see Crow (1993). For confidence intervals on the mean cumulative repair function, see Crow (1982). To assess the suitability of the power model for a set of data (i.e., to test the hypothesis that the failure times of any number of systems follow an NHPP with a power law intensity rate), Crow (1974) derived a test statistic CN2 s and tabulated the critical values. The test is based on transforming the repair times for each system. Under time censoring, divide each repair time by the corresponding system censoring time; under failure censoring, divide each repair time except the last one by the corresponding time of the last repair. Consequently, the total number of transformed values will equal N s, defined previously. These values are combined and ordered from smallest to largest. Call these ordered numbers Zi , i = 1, ... , N s. The goodness-of-fit statistic is

CN2 s =

1 + 12 N s

Ns

∑ i=1

ˆ

Zib −

2i − 1 2Ns

2

If the test statistic is greater than the critical value of Table 13.4 (from Crow 1990), then we reject the NHPP with power relation intensity function as an appropriate model for the data. Note that if we use b = 1 in the goodness-of-fit statistic, we are testing whether the failure times follow an HPP model. For N s > 100, use values for N s = 100. EXAMPLE 13.5 NHPP WITH POWER RELATION INTENSITY Table 13.5 contains the repair history for three simulated systems. Estimate the parameters for the power law model, assuming the systems can be treated as equivalent copies. Perform a goodnessof-fit test using the CN2 s test statistic.

492

Applied Reliability

TABLE 13.4 Critical Values for Goodness-of-Fit Test Significance Level Ns

.20

.15

.10

.05

.01

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 60 100

.138 .121 .121 .121 .123 .124 .124 .124 .125 .126 .126 .126 .126 .126 .127 .127 .127 .127 .128 .128 .128 .129

.149 .135 .134 .137 .139 .140 .141 .142 .142 .143 .144 .144 .144 .144 .145 .145 .146 .146 .146 .146 .147 .147

.162 .154 .155 .160 .162 .165 .165 .167 .167 .169 .169 .169 .169 .169 .171 .171 .171 .171 .172 .172 .173 .173

.175 .184 .191 .199 .204 .208 .210 .212 .212 .214 .214 .214 .214 .215 .216 .217 .217 .217 .217 .218 .220 .220

.186 .23 .28 .30 .31 .32 .32 .32 .32 .32 .32 .33 .33 .33 .33 .33 .33 .33 .33 .33 .33 .34

Source: Crow, L. H. 1990. Evaluating the reliability of repairable systems. In Proceedings Annual Reliability and Maintainability Symposium, 275–9.

TABLE 13.5 Repair History in Hours (Simulated Data: a = 0.25 and b = 0.50) Repair Number

System 1

System 2

System 3

1 2 3 4 5 6 7 8 9 Censoring Times Censoring Type

15.1 47.1 51.1 158.7 221.2 495.5 769.2 965.1

15.2 122.7 172.1 288.1 371.9 376.2 567.2 797.7 981.8 981.8 Failure

11.3 43.3 122.8 203.3 294.0 468.0 800.7 987.4

1000 Time

1000 Time

493

Repairable Systems Part II: Nonrenewal Processes

Solution The calculations are as follows: System 1: N1 = n1 = 8, and T1 = 1000 N1

∑ ln i =1

T1 = t i1

8

∑ ln i =1

1000 1000 1000 = ln + + ln = 14.57 15.1 955.1 t i1

System 2: N2 = n2 − 1 = 9 − 1 = 8, and T2 = tn2 = 981.8 N2

∑

ln

i =1

T2 = ti 2

8

∑ ln i =1

981.8 981.8 981.8 = ln + + ln = 11.90 ti 2 15.2 797.7

System 3: N3 = n3 = 8, and T3 = 1000 N3

∑ i =1

ln

T3 = ti 3

8

∑ ln i =1

1000 1000 1000 = ln + + ln = 13.53 11.3 987.4 ti 3

Since Ns = 8 + 8 + 8 = 24, the unbiased modified MLE for b is bˆ = ( 24 − 1)/(14.58 + 11.90 + 13.53) = 0.575

The parameter a is estimated by

aˆ = 24/(10000.575 + 981.80.575 + 10000.575 ) = 0.158 The model for the mean cumulative repair function per system is

M(t ) = 0.158t 0.575 To check the goodness of fit, we transform the repair times by dividing the repair ages for systems 1 and 3 by 1000, the censoring time, and for system 2 by the last repair time of 981.8 (Table 13.6). We now order the transformed times to get the Z values: Z1 = 0.0113, Z 2 = 0.0151, … , Z 24 = 0.9874. Substituting these values into the goodness-of-fit statistic with bˆ = 0.575 and Ns = 24, we get

CN2 s =

1 1 + .01130.575 − 12( 24) 48

2

+ + .98740.575 −

47 48

2

= 0.0403

For a test at the 5% significance level, Table 13.4 shows a critical value of 0.217. Since 0.0403 is less than 0.217, we accept the NHPP power model.

494

Applied Reliability

TABLE 13.6 Transformed Repair Times System 1 0.0151 0.0471 0.0511 0.1587 0.2212 0.4955 0.7692 0.9651

System 2

System 3

0.0155 0.1250 0.1753 0.2934 0.3788 0.3832 0.5777 0.8125

0.0113 0.0433 0.1228 0.2033 0.2940 0.4680 0.8007 0.9874

Solution Using Excel Solver It is important to compare the fit of the model to the data. Figure 13.15 shows the cumulative plots for the three systems and the estimated model. We see that the MLEs have some bias in that the model appears to underestimate the behavior of the individual systems. As was described in Chapter 6, it may be possible to improve the model fit through the use of Excel’s solver program applied to the sum of squared residuals. The residual is the difference between the model fit and the data at the observed repair times. Referring to Figure 13.16, we can use the MLEs for the model parameters aˆ and bˆ as starting values, entered in the two cells adjacent to the labels “b est” and “a est.” On the left side of Figure 13.16, we have four columns: one for the repair number and three columns for the model predictions based on the MLEs in the two cells, evaluated at the repair ages for each system. The second set of three columns consists of the residuals. We then use the spreadsheet function SUMSQ(residuals), which squares and sums the residuals, to obtain the value 12.56 using the MLE parameter estimates. Then, we use the SOLVER routine to minimize the target cell with the SUMSQ formula by changing the values in the two parameter cells (Figure 13.17). We see that the SUMSQ(residuals) is now 6.42, indicating an improved model fit to the data. Also, a cumulative plot and the SOLVER model (Figure 13.18) confirm the improvement in the model fit to the data. Since the data were originally simulated by using the parameters b = 0.50 and a = 0.25, it is interesting to note the closer agreement of the actual parameter values with the SOLVER results (0.468, 0.331) compared to the MLEs (0.575, 0.158).

EXERCISE 13.15 Construct a cumulative plot based on data for the three systems in Table 13.5. For the observed (combined) repair times, estimate the mean cumulative repair function and add to the previous plot. Comment on the results.

EXERCISE 13.16 Apply the MLE formulas modified by Crow (1974) to estimate the parameters a and b for the data of Figures 13.2 and 13.5. Plot the fitted power relationship on the cumulative plots.

EXERCISE 13.17 Apply the Excel SOLVER routine to estimate the parameters a and b for the data of Figures 13.2 and 13.5. Plot the fitted power relationship on the cumulative plots. Compare to the results of Exercise 13.16.

495

Repairable Systems Part II: Nonrenewal Processes

10 9 8

Repair number

7 6 5 System 1

4

System 2

3

System 3

2

Model

1 0

0

200

400

600 Age (hours)

800

1000

1200

FIGURE 13.15 Cumulative plot of MLE model fit to system data.

Repair Number 1 2 3 4 5 6 7 8 9

Model Fit

Residuals

Residuals

System 1

System 2

System 3

System 1

System 2

System 3

SUMSQ

0.75 1.44 1.51 2.90 3.51 5.58 7.19 8.19

0.75 2.50 3.04 4.09 4.74 4.77 6.04 7.34 8.27

0.64 1.38 2.50 3.35 4.14 5.40 7.36 8.30

−0.25 −0.56 −1.49 −1.10 −1.49 −0.42 0.19 0.19

−0.25 0.50 0.04 0.09

−0.36 −0.62 −0.50 −0.65 −0.86 −0.60 0.36 0.30

12.56

−0.26 −1.23 −0.96 −0.66 −0.73

b est a est

MLE 0.575 0.158

FIGURE 13.16 Spreadsheet setup for applying SOLVER routine (MLE parameters). Repair Number 1 2 3 4 5 6 7 8 9

Model Fit

Residuals

Residuals

System 1

System 2

System 3

System 1

System 2

System 3

SUMSQ

1.18 2.01 2.09 3.55 4.15 6.05 7.44 8.27

1.18 3.15 3.69 4.70 5.29 5.32 6.45 7.56 8.34

1.03 1.93 3.15 3.99 4.74 5.89 7.58 8.36

0.18 0.01

0.18 1.15 0.69 0.70 0.29

0.03 −0.07 0.15

6.42

FIGURE 13.17 Spreadsheet showing SOLVER results.

−0.91 −0.45 −0.85 0.05 0.44 0.27

−0.68 −0.55 −0.44 −0.66

−0.01 −0.26 −0.11 0.58 0.36

b est a est

Solver Solution 0.468 0.331

496

Applied Reliability

10 9 8

Repair number

7 6 5 System 1

4

System 2

3

System 3

2

Model

1 0

0

200

400

600 Age (hours)

800

1000

1200

FIGURE 13.18 Cumulative plot of MLE model fit to system data.

13.8.2 Exponential Model Consider an NHPP for which the intensity function is λ(t) = e c+bt = ae bt

where b and c are empirically determined parameters and a = e c. The mean repair function is found by simple integration: t

M(t) =

∫

λ(τ)dτ =

0

e c bt a (e − 1) = (e bt − 1) b b

For a single system having data censored at the nth failure (failure censored), for which the ˆ time tn to the nth repair is random, Lawless (2003) provides the MLEs bˆ and cˆ . To find b, we need to substitute trial values for bˆ into the following equation until a solution occurs: n

∑ t + nbˆ − 1 −nte

n ˆ − bt n

i

i=1

=0

where ti , i = 1, ... , n are the chronological system ages at repair. Then, cˆ is estimated from the expression

cˆ = ln

nbˆ e −1 ˆ bt n

497

Repairable Systems Part II: Nonrenewal Processes

If the data are censored at fixed time T (time censored), the number of repairs N is random, and the MLE for bˆ is found by trial substitution in the equation N

∑ t + Nbˆ − 1 −NTe i

=0

ˆ − bT

i=1

Then, we evaluate cˆ using the formula Nbˆ e −1

cˆ = ln

ˆ bT

If b = 0 in the exponential model, then the intensity function λ(t) = ec is constant, indicating a Poisson process (HPP). To test whether b = 0 and consequently the data are consistent with a Poisson process, we use a test statistic from Lawless (2003). Under the null hypothesis of an HPP, the test statistic for failure censoring is n− 1

∑ t − 12 (n − 1)t

n

i

U=

i=1

tn

1 2

n−1 12

For time censoring at T, the test statistic from Cox and Lewis (1966) is N

∑ t − 21 NT i

U=

i=1

N T 12

1 2

In either situation, U is approximately N(0,1), that is, normally distributed with mean 0 and variance 1. We recognize this test as the Laplace test previously described. If we have data from several systems, the observations from all systems can be combined to provide an overall test of the null hypothesis. Under the hypothesis that each of k processes is HPP (allowing for different intensities), Cox and Lewis (1966) show, for failure censoring on each system, that the centroid test statistic k

U=

nl − 1

∑∑ l=1

tli −

i=1

1 12

1 2

k

∑ (n − 1)t

ln

l

l=1

k

∑ (n − 1)t l

1 2

2 ln

l=1

is approximately N(0,1). Note that nl is the number of repairs for the lth process, t ji is the system age at the ith repair for the jth system, and t jn is the last repair time (censoring point) for the jth system. For time-censored systems, replace each nl − 1 in the above formula, including summation limits, with the number of repairs N l, and replace each tln with the observation (censoring) age.

498

Applied Reliability

EXAMPLE 13.6 NHPP WITH EXPONENTIAL INTENSITY MODEL A system experiences 10 repairs at the following ages (simulated from an NHPP with b = 0.001 and a = 0.005):

59.0, 207.5, 284.9, 484.0, 552.8, 636.4, 826.0, 988.4, 1008.2, and 1070.1 The data stops at the 10th repair (failure censored). Estimate the parameters b and c for the exponential model. Estimate the intensity function and the mean repair function. Check the model fit using the Laplace test.

Solution We need to solve the equation for bˆ iteratively by substituting trial values, where n = 10 and 10

∑ t = 6117.3. That is, we seek bˆ such that i

i=1

6117.3 +

10 10(1070.1) − =0 ˆ bˆ 1 − e − b(1070.1)

Using the SOLVER routine in an Excel spreadsheet (starting with trial bˆ = 0.01, and setting target 10(0.0008137) = 0.005860 or cˆ = −5.140. The intencell to 0), we get bˆ = 0.0008137. Then, aˆ = 0.0008137(1070.1) e −1 sity function estimate is λˆ (t ) = e( −5.14+0.0008137t ) = 0.00586e 0.0008137t

The mean cumulative repair function estimate is ˆ (t ) = 7.20(e 0.0008137t − 1) M

Under failure c, the U statistic for the Laplace test is n−1

∑ t − 12 (n − 1)t i

U=

i =1

tn

n−1 12

1 2

n

=

5047.2 − (9/ 2)1070.1 = 0.250 1070.1 9/12

Comparing U = 0.250 to the standard normal distribution shows that this result offers no evidence against the HPP. In fact, if we plot the actual data against the NHPP exponential model and the HPP model (Figure 13.19), we see that the HPP model, which is the simpler one, provides a better fit to the data. The power and the exponential models are only two of many possible relations for the intensity function of an NHPP. (For further discussion on these and other models, see chapter 5 in Ascher and Feingold 1984.) EXERCISE 13.18 Using the results of Example 13.3, plot the mean cumulative repair function estimate ˆ (t) on the cumulative plot. and compare to the observed results by overlaying M

499

Repairable Systems Part II: Nonrenewal Processes

12

10

Repair number

8

6

4

Actual M(t) HPP M(t)

2

0

NHPP M(t)

0

200

400

600 Age (hours)

800

1000

1200

FIGURE 13.19 Cumulative plot of HPP and NHPP models fit to system data.

EXERCISE 13.19 The following repair ages were recorded for a system: 212, 459, 704, 834, 953, and 1036. Observation occurred only to the sixth repair. Assuming an exponential model for the intensity λ(t), estimate the parameters and plot the model fit against the original data. Check the model fit using the Laplace test. EXERCISE 13.20 As was done in Example 13.5, apply the SOLVER routine to the data in Example 13.6 to determine parameters a and b. Compare the SOLVER model fit to the MLE model fit using the cumulative sum of the squared residuals. Plot the two models and the actual data in a single graph for a visual comparison. EXERCISE 13.21 In Example 13.6, compare the HPP model fit to the NHPP model fit using the cumulative sum of the squared residuals.

13.9 Rate of Occurrence of Failures Recall the derivative of the expected cumulative number of repairs in time:

m(t) =

dM(t) dt

500

Applied Reliability

Ascher and Feingold (1984) call m(t) the ROCOF. Note that the instantaneous repair rate m(t) is different from the observed average rate of occurrence of events defined as the total number of repairs N observed in an interval of length t divided by t, that is,

n = N/t

Alternatively, when the total number of repairs is k, we may calculate an average or mean time between failures (MTBF) as the total time T ( k ) divided by the cumulative number of repairs, that is,

MTBFcum =

T (k ) k

In studying reliability growth, that is, improvement in reliability as a product is improved, Duane (1964) discovered the empirical relationship that plotting MTBFcum versus time t on log–log paper often resulted in a nearly straight line. The slope of this line then served as a basis for assessing reliability growth. In Section 13.10, we present a discussion of reliability growth in further detail (see also O’Connor 2002, for more on this topic). EXERCISE 13.22 Take the three data sets (renewal, improving, and degrading) presented in Chapters 12 and 13 and construct Duane plots. Do the plots look linear? Note that plotting on log–log paper tends to linearize data plots. Considering this last statement, draw some conclusions about the Duane plots.

13.10 Reliability Growth Models Reliability growth can be an important consideration in the development of new products or systems. How can we model the improvement in reliability expected when we test a system during development and analyze all failures and take corrective actions? We previously discussed systems with improving repair rate within the general context of repairable systems. The models that will be described in this section have already been introduced. Here we will look at these models again from the perspective of tests or procedures aimed at improving system or equipment reliability prior to customer availability. Consider the following situation: A complex and expensive piece of equipment has been designed, developed, and prototyped, and is now undergoing reliability testing. An initial assessment of the MTBF is 10 hours. Since the primary customer has a minimum MTBF requirement of 100 hours, significant reliability improvement is needed. The equipment manufacturer decides to run a lengthy test, operating the equipment as the user would while paying careful attention to all failures. Every failure is analyzed until the root cause is discovered, and wherever possible, the equipment design or subcomponent parts selection is modified to eliminate or significantly reduce future failures of the type discovered. This kind of equipment improvement testing is sometimes given the acronym TAAF, for test, analyze, and fix. Another term used in industry for this type of testing is IRONMAN testing. Originally, IRONMAN was an acronym for improve reliability of new machines at night, referring to a practice where the equipment was tested during the day and modified

Repairable Systems Part II: Nonrenewal Processes

501

and improved during the night. Now, however, the term IRONMAN is still used regardless of when the sequence of testing and improving is carried out. After an extended TAAF program, it is natural to expect the equipment (or system) to operate at a higher level of reliability: the repair rate will be lower and the MTBF will be higher than it was at the start of the test. Consequently, models that estimate the rate of reliability growth are known as reliability growth models. An appropriate reliability growth model should be able to predict how much the MTBF will improve as the test time increases. Reliability growth models assume that once the improvement process is completed, the repair rate (or its reciprocal, the MTBF) is constant. Using the terminology described in Chapter 12, a constant repair rate implies an HPP. The goal of the reliability improvement testing program is to test long enough to increase the operational MTBF so that it satisfies customer requirements. Duane (1964) studied data from several different products, all of which had been through reliability improvement (TAAF) testing. He noted that he consistently obtained good linear relationships when he plotted a cumulative estimate of the MTBF versus system age (or cumulative operating time) using log–log graph paper. The cumulative estimate of the MTBF at the system age tk of the kth failure is defined as cum (t ) = tk MTBF k k

This MTBF is a cumulative estimate because it uses all the failures from the beginning of the test just as if the failure rate had been constant throughout. Duane’s empirical relationship states that, for any total time t, ln MTBFcum (t) = α + β ln(t). This relationship can be expressed as a power law model

MTBFcum = e αtβ

Here we use natural logarithms. If base 10 logarithms are used, β is unchanged and α is replaced by α/ln 10. Using notation from Chapter 12, let M(t) be the average number of repairs up to system age t. Then, we can write the definition of the cumulative MTBF as MTBFcum(t) = t/M(t), and the Duane relationship is equivalent to M(t) = at b

where a = e−α and b = 1 − β. This expression is just the power relationship model (sometimes misleadingly called a “Weibull process”) discussed previously. The intensity function, or instantaneous repair rate for this process, is the derivative m(t) = M (t) = abt b−1. The MTBF at the time we stop the test is just the reciprocal of the instantaneous repair rate evaluated at that time. (Remember that the MTBF becomes constant when the test ends.) If this instantaneous MTBF is denoted by MTBFI(t), it follows that

MTBFI (t) =

1 1 MTBFcum (t) MTBFcum (t) = = = m(t) abt b−1 b 1−β

If we have estimates for a and b (or α and β), we can use the last equation to estimate the MTBF = MTBFI at the end of test. Since reliability has been improving throughout the test (0 < b < 1), the MTBFI estimate at the end of test will be greater than the MTBFCUM estimate

502

Applied Reliability

at the end of test by the factor 1/(1 − β), corresponding to a parallel shift of the original Duane model straight line, extended forward to any desired end of test time. The graphical estimation approach plots the cumulative MTBF estimates at each repair time against the system age at that repair time, using log–log graph paper. A best-fitting line is drawn through these points, and the slope of the line is the β estimate and the intercept of the line (at t = 1 since ln(1) = 0) is the α estimate. It is common to use a least squares regression program to eliminate the subjectivity of hand drawing the line (see Chapter 6). If the system ages at repair times are t1, t2, … , tn, then the dependent variable is y1 = ln(t1), y2 = ln(t2/2), y3 = ln(t3/3), … , yn = ln(tn/n). The independent variable is x1 = ln(t1), x2 = ln(t2), … , xn = ln(tn). Any standard regression program will estimate α and β using the least squares estimation formulas given in Chapter 6. The following example illustrates the use of Duane plots for reliability growth estimation. EXAMPLE 13.7 DUANE RELIABILITY GROWTH ESTIMATION A multimillion dollar state-of-the-art semiconductor processing tool has an unacceptable reliability performance record and it typically needs repairs on a daily basis. Customer requirement is for an MTBF of at least 100 hours. The manufacturer decides to invest in a 12-week reliability improvement testing program, during which the tool will be exercised as much as possible. All failures are carefully traced to root causes, and design and component changes are introduced during the course of the testing. Twelve failures are recorded at the following system age hours: 37, 56, 176, 262, 516, 522, 544, 588, 719, 755, 861, and 1166. Assuming the test ends after the last repair, use the Duane plot technique to estimate the MTBF reliability growth model and to calculate the MTBF of the tool at the end of the test.

Solution The 12 MTBFcum (tk) estimates corresponding to the 12 failure times are 37, 28, 59, 66, 103, 87, 78, 74, 80, 76, 78, and 97. These are calculated by simply dividing the system age at the kth failure by k. For example, MTBFcum(t3) = 176/3 or approximately 59. When these estimates are plotted against the corresponding failure times on log–log paper, we obtain the Duane plot shown in Figure 13.20. The line through the points was fit using least squares. The dependent variable is a vector of the logarithms of the MTBFcum estimates (or 3.61, 3.33, 4.07, 4.18, 4.64, 4.47, 4.35, 4.3, 4.38, 4.32, 4.36, 4.58). The independent variable is a vector of the logarithms of the system ages at failure (or 3.61, 4.03, 5.17, 5.57, 6.25, 6.30, 6.38, 6.58, 6.63, 6.76, 7.06). The least squares estimate of α is 2.32 and the estimate of β is 0.322. The predicted value of the MTBFI at the end of the test (or at 1166 hours) is (e2.32 × 11660.322)/(1 − 0.322) = 146.2 hours. By comparison, at the start of the test (for which we use 1 hour because of the log–log scale), the MTBFI according to the model is e2.32/0.678 = 15.0. So the Duane plot indicates a successful reliability improvement test, with the MTBF increasing by almost 10 times and exceeding user requirements at the end of the test. Figure 13.20 shows both the least squares line through the data points and the instantaneous MTBF line that predicts what the improved MTBF estimate would be if the test had been stopped at any particular point in time. The only “real” point on this line is the one at 1166 hours. However, it is customary to plot the entire instantaneous MTBF line on a Duane plot so that “what if” questions can be entertained about the effects of shorter or longer test times.

The larger β (the slope of the reliability growth line) is, the more rapidly the MTBF improves. O’Connor (1991) gives some guidelines for typical values of β. Generally, β will lie between 0.2 and 0.6, depending on how timely and effective the corrective actions are for eliminating failure modes found during the test. The Duane empirical procedure is a simple model for reliability growth that has been used successfully in the literature many times. However, it does have several disadvantages. There is no way to put valid confidence bounds around the graphical estimates, or

503

Repairable Systems Part II: Nonrenewal Processes

1000

Cumulative MTBF

100

10 MTBFcum Least squares line Instantaneous MTBF MTBFI

1

1

10

100 Cumulative time t

1000

10000

FIGURE 13.20 Duane plot of cumulative MTBF versus cumulative time with least squares line.

apart from the appearance of the fit of the line, to test whether the data are consistent with the Duane model. In addition, because the points are highly correlated and have differing statistical distributions, least squares estimates have no statistical optimality properties and are merely an objective way to fit a line through the points. Crow (1974) developed the power relationship stochastic model for repairable systems that provided a theoretical basis for the empirical Duane model. This work was done for the U.S. Army Materials Systems Analysis Activity, and it is often referred to as the AMSAA model. As described in Section 13.8.1, MLEs for the parameters a = e−α and b = 1 − β of the AMSAA power relationship model are given by

bˆ =

n n− 1

∑ ln i=1

tn ti

n , aˆ = bˆ tn

for the case where n repair actions take place at times t1, t2, … , tn, and the test ends right after the nth failure. If the test continues beyond the time of the last failure to a total time of T, the MLEs are given by

504

Applied Reliability

bˆ =

n

n− 1

∑ i=1

n , aˆ = bˆ T T ln ti

The MLE of the final MTBF for a test that lasts until the nth fail is ˆ

b I (t ) = 1ˆ = tnˆ = tn MTBF n b−1 ˆ ˆ ˆ n abt nbtnb −1 nbˆ

If the test is terminated at time T, where T is greater than the time of the last fail, replace tn by T in the last equation to obtain the MLE for a time-censored test: I (T ) = T MTBF nbˆ

The MLE for b has optimal statistical properties as the number of failures gets large. However, it has a built-in bias that causes it to overestimate b (and underestimate the reliability growth slope β). This bias can be corrected by multiplying b by a factor that depends on the number of failures. An unbiased estimate of b when the test is terminated at the nth failure is given by (n − 2)bˆ b= n If, instead, the test is terminated at a time T after the nth failure, an unbiased estimate of b is given by (n − 1)bˆ b= n For n below around 50, the correction factors above will certainly have a significant effect on the estimate of b and should be used. For large n, applying the correction factor will make little practical difference and is optional. Whenever the corrected estimate b is used, use a = n T b for the estimate of a for consistency (although, unfortunately, it has no guarantee of being an unbiased estimate). We will call b and a “modified MLEs” and use them to calculate b and a estimates, as well as the estimated MTBFI at the end of test. Lines plotted on a Duane plot based on these modified MLEs will be called “modified MLE lines.” Two important aspects of a reliability growth analysis remain: testing for the adequacy of the power relationship model and deriving an upper and a lower bound for the MTBFI at the end of the reliability improvement test. The CN2 test for the power relationship model (from Crow 1974) was described in Section 13.8.1. If CN2 is larger than an appropriate critical value from Table 13.4, use of the power relationship model for the failure data becomes questionable. Table 13.4 gives critical values at significance levels ranging from 20% (an hypothesis test at the 80% confidence level) to 1% (a 99% confidence level test). To calculate CN2 for a reliability improvement test that ends at the time of the nth failure, set N = n − 1 and use the equation

1 + C = 12 N 2 N

where b is the modified MLE of b.

N

∑ i=1

ti tn

b

2

2i − 1 − 2N

505

Repairable Systems Part II: Nonrenewal Processes

If the test ends at a time T, which is greater than the last failure time tn, then set N = n and use the equation 1 + C = 12 N 2 N

N

∑ i=1

ti T

b

2

2i − 1 − 2N

A confidence interval for the MTBFI at the end of the test is easily obtained using the factors in Tables 13.7 and 13.8 (reproduced with permission from Crow 1982). For a test that ends at the nth failure, look up the values of R1 and R2 from Table 13.7, which correspond to n and the confidence level desired. Multiply the (unmodified) MLE of MTBFI (tn) by R1 and R2 to obtain the lower and upper limits of the confidence interval. The interval is (t ), R × MTBF (t )) ( R × MTBF

I

1

n

2

I

n

If a one-sided limit is desired, the lower or upper limit of a 90% confidence interval is a 95% lower or upper limit (and, in general, the lower or upper limit of a 100 × [1 − α] confidence interval is a 100 × [1 − α/2] lower or upper bound). For a time-censored test ending at T > tn (where tn is the time of the nth failure), use the values of P1 and P2 from Table 13.8, which correspond to n and the confidence level desired. Again, be sure to multiply the unmodified MLE of MTBFI (T) by these factors to obtain the lower and upper limits of the confidence interval. Tables 13.7 and 13.8 stop at n = 100. While it is unlikely that a real reliability growth test will have more than 100 failures, Crow (1982) gives formulas for approximate confidence bounds that are quite accurate for large n. For a test that ends at the nth failure, calculate an approximate 100 × (1 − α) percent confidence interval using

1

R1 =

2 n

1 + Zα/2

, R2 =

1 1 − Zα/2

2 n

where Zα/2 is the (1 − α/2)th percentile of the standard normal distribution. For a test ending at time T > tn, calculate an approximate 100 × (1 − α) percent confidence interval using 2

P1 =

n n + Zα/2

n 2

2

, P2 =

n n − Zα/2

n 2

EXAMPLE 13.8 CONFIDENCE BOUNDS AND MODIFIED MLES Suppose a test has 16 failures and ends at 1250 hours, which you are told is the time of the 16th failure. The MLE of b has been calculated and is bˆ = 0.72. Give an estimate and a 90% confidence interval for MTBFI(1250), how do the estimate and confidence interval change if you learn the test

506

Applied Reliability

TABLE 13.7 R1 and R 2 Values to Multiply MTBF Estimate and Obtain Confidence Bounds (Test Ends at nth Fail) Confidence Level Num Fails n

80%

90%

95% R2

R1

98%

R1

R2

R1

R2

2

.8065

33.76

.5552

72.67

.4099

3

.6840

8.927

.5137

14.24

.4054

21.96

.3119

37.60

4

.6601

5.328

.5174

7.651

.4225

10.65

.3368

15.96

5

.6568

4.000

.5290

5.424

.4415

7.147

.3603

9.995

6

.6600

3.321

.5421

4.339

.4595

5.521

.3815

7.388

7

.6656

2.910

.5548

3.702

.4760

4.595

.4003

5.963

8

.6720

2.634

.5668

3.284

.4910

4.002

.4173

5.074

9

.6787

2.436

.5780

2.989

.5046

3.589

.4327

4.469

10

.6852

2.287

.5883

2.770

.5171

3.286

.4467

4.032

11

.6915

2.170

.5979

2.600

.5285

3.054

.4595

3.702

12

.6975

2.076

.6067

2.464

.5391

2.870

.4712

3.443

13

.7033

1.998

.6150

2.353

.5488

2.721

.4821

3.235

14

.7087

1.933

.6227

2.260

.5579

2.597

.4923

3.064

15

.7139

1.877

.6299

2.182

.5664

2.493

.5017

2.921

16

.7188

1.829

.6367

2.144

.5743

2.404

.5106

2.800

17

.7234

1.788

.6431

2.056

.5818

2.327

.5189

2.695

18

.7278

1.751

.6491

2.004

.5888

2.259

.5267

2.604

19

.7320

1.718

.6547

1.959

.5954

2.200

.5341

2.524

20

.7360

1.688

.6601

1.918

.6016

2.147

.5411

2.453

21

.7398

1.662

.6652

1.881

.6076

2.099

.5478

2.390

22

.7434

1.638

.6701

1.848

.6132

2.056

.5541

2.333

23

.7469

1.616

.6747

1.818

.6186

2.017

05601

2.281

24

.7502

1.596

.6791

1.790

.6237

1.982

.5659

2.235

25

.7534

1.578

.6833

1.765

.6286

1.949

.5714

2.192

26

.7565

1.561

.6873

1.742

.6333

1.919

.5766

2.153

27

.7594

1.545

.6912

1.720

.6378

1.892

.5817

2.116

28

.7622

1.530

.6949

1.700

.6421

1.866

.5865

2.083

29

.7649

1.516

.6985

1.682

.6462

1.842

.5912

2.052

30

.7676

1.504

.7019

1.664

.6502

1.820

.5957

2.023

35

.7794

1.450

.7173

1.592

.6681

1.729

.6158

1.905

40

.7894

1.410

.7303

1.538

.6832

1.660

.6328

1.816

45

.7981

1.378

.7415

1.495

.6962

1.606

.6476

1.747

50

.8057

1.352

.7513

1.460

.7076

1.562

.6605

1.692

60

.8184

1.312

.7678

1.407

.7267

1.496

.6823

1.607

70

.8288

1.282

.7811

1.367

.7423

1.447

.7000

1.546

80

.8375

1.259

.7922

1.337

.7553

1.409

.7148

1.499

100

.8514

1.225

.8100

1.293

.7759

1.355

.7384

1.431

151.5

R1 .2944

R2 389.9

507

Repairable Systems Part II: Nonrenewal Processes

TABLE 13.8 P1 and P2 Values to Multiply MTBF Estimate and Obtain Confidence Bounds (Test Ends at Time T) Confidence Level 80%

Num Fails n

90%

95%

98%

P1

P2

P1

P2

P1

P2

P1

2

.261

18.66

.200

38.66

.159

78.66

.124

3

.333

6.326

.263

9.736

.217

14.55

.174

24.10

4

.385

4.243

.312

5.947

.262

8.093

.215

11.81

5

.426

3.386

.352

4.517

.300

5.862

.250

8.043

6

.459

2.915

.385

3.764

.331

4.738

.280

6.254

7

.487

2.616

.412

3.298

.358

4.061

.305

5.216

8

.511

2.407

.436

2.981

.382

3.609

.328

4.539

9

.531

2.254

.457

2.750

.403

3.285

.349

4.064

10

.549

2.136

.476

2.575

.421

3.042

.367

3.712

11

.565

2.041

.492

2.436

.438

2.852

.384

3.441

12

.579

1.965

.507

2.324

.453

2.699

.399

3.226

13

.592

1.901

.521

2.232

.467

2.574

.413

3.050

14

.604

1.846

.533

2.153

.480

2.469

.426

2.904

15

.614

1.800

.545

2.087

.492

2.379

.438

2.781

16

.624

1.759

.556

2.029

.503

2.302

.449

2.675

17

.633

1.723

.565

1.978

.513

2.235

.460

2.584

18

.642

1.692

.575

1.933

.523

2.176

.470

2.503

19

.650

1.663

.583

1.893

.532

2.123

.479

2.432

20

.657

1.638

.591

1.858

.540

2.076

.488

2.369

21

.664

1.615

.599

1.825

.548

2.034

.496

2.313

22

.670

1.594

.606

1.796

.556

1.996

.504

2.261

23

.676

1.574

.613

1.769

.563

1.961

.511

2.215

24

.682

1.557

.619

1.745

.570

1.929

.518

2.173

25

.687

1.540

.625

1.722

.576

1.900

.526

2.134

26

.692

1.525

.631

1.701

.582

1.873

.531

2.098

27

.697

1.511

.636

1.682

.588

.537

2.068

28

.702

1.498

.641

1.664

.594

1.825

.543

2.035

29

.706

1.486

.646

1.647

.599

1.803

.549

2.006

30

.711

1.475

.651

1.631

.604

1.783

.554

1.980

35

.729

1.427

.672

1.565

.627

1.699

.579

1.870

40

.745

1.390

.690

1.515

.646

1.635

.599

1.788

45

.758

1.361

.705

1.476

.662

1.585

.617

1.723

50

.769

1.337

.718

1.443

.676

1.544

.632

1.671

60

.787

1.300

.739

1.393

.700

1.481

.657

1.591

70

.801

1.272

.756

1.356

.718

1.435

.678

1.533

80

.813

1.251

.769

1.328

.734

1.399

.695

1.488

100

.831

1.219

.791

1.286

.758

1.347

.722

1.423

.1848

P2 198.7

508

Applied Reliability

was planned to end at 1250 hours, and the 16th failure actually occurred before that time (assume the MLE bˆ = 0.72 was calculated correctly for a time-censored test)?

Solution First, we calculate the MLE for the MTBFI at the end of test, that is, tn /(nbˆ ) = 1250/(16 × 0.72) = 108.5. To construct a 90% confidence interval for a test that ends upon reaching the 16th failure, we look up R1 = 0.6367 and R 2 = 2.144 from Table 13.7. The interval for MTBFI(1250) is 0.6367 × 108.5, 2.144 × 108.5) = (69.1, 232.6). Instead of using bˆ for our final estimate for b, we calculate the modified MLE b = (14/16) × 0.72 = 0.63

b = (14/16) × 0.72 = 0.63. Using this estimate of b, we recalculate the MTBFI(1250) estimate as 1250/(16 × 0.63) = 124. This value is our point estimate for the MTBF at the end of test, while the confidence interval remains as previously calculated. After we learn the test was really time censored (i.e., planned to stop at 1250 hours), we recalculate the confidence interval using P1 = .556 and P2 = 2.029 from Table 13.8. The 90% confidence interval for MTBFI(1250) is (0.556 × 108.5, 2.029 × 108.5) = (60.3, 220.2). Finally, we modify our MLE of b and the MTBF at the end of test using b = (15 / 16) × 0.72 = 0.675 and MTBFI(1250) = 1250/(16 × 0.675) = 115.7. EXAMPLE 13.9 POWER RELATIONSHIP MODEL RELIABILITY GROWTH Use the MLE formulas for the power relationship model to calculate the Duane plot slope for the data in Example 13.7. Next, estimate the achieved MTBF at 1166 hours and give 90% confidence bounds for this MTBF. Is the power relationship model a reasonable assumption for these data?

Solution The MLEs are aˆ = 0.030 and bˆ = 0.848. It follows that the MLE for β is 1 − 0.848 = 0.152, and the MLE for α is −ln(0.030) = 3.5. The MTBFI at the end of the test is estimated to be 1166/ (12 × 0.848) = 114.6 hours. From Table 13.7, we look up factors that multiply the MTBFI MLE to obtain a lower 5% bound and an upper 95% bound. For 12 failures, these factors are R1 = 0.6067 and R 2 = 2.464. Applying these factors, the estimate of 114.6 generates a 90% confidence interval for the achieved MTBFI of (69.4, 281.7). The modified MLE for b is (10/12) × 0.848 = 0.706 and for β is 0.294. The modified MTBFI at the end of test estimate is 1166/(12 × 0.706)=137.6. Finally, we use the CN2 statistic defined earlier to test the hypothesis that the failure data are consistent with the power relationship model. The test statistic is C112 = 0.0860, and from Table 13.4, this is not significant even at an 80% confidence level. Therefore, it is reasonable to use a power relationship model for the reliability growth in this example.

Figure 13.21 shows the Duane plot with the modified maximum likelihood MTBFcum(t) line and the corresponding modified MTBFI(t) line (here, as before, “modified” means b and a were used instead of the MLEs bˆ and aˆ). Examples 13.7 and 13.9 show that the Duane least squares slope and the power relationship model MLE slope may differ considerably (0.322 vs. 0.152). However, the modified MLEs were in close agreement with the least squares estimates (0.294 vs. 0.322 for β and 137.6 vs. 146.2 for the MTBFI at the end of test). A complete reliability growth data analysis begins with a Duane plot and continues with MLE and modified MLEs and a test for the fit of the power relationship model. If the model is accepted, the final output of the analysis should be the modified MLE point estimate of the MTBFI at the end of the test, with upper and lower confidence bounds. EXERCISE 13.23 Repeat Example 13.8 assuming that the test ended at 1400 hours with no more failures observed after the failure at 1166 hours.

509

Repairable Systems Part II: Nonrenewal Processes

1000

Cumulative MTBF

100

10 MTBFcum Modified MLE line Instantaneous MTBF (modified MLE) MTBFI (modified)

1

1

10

100 Cumulative time t

1000

10000

FIGURE 13.21 Duane plot with modified MLE lines.

EXERCISE 13.24 A prototype model of a new piece of equipment is put through a reliability improvement test. Fails are observed at the following times: 18, 20, 35, 41, 68, 211, 252, 288, 380, 382, 408, 449, and 532 hours. The test is stopped at 650 hours. Draw a Duane plot of the data and calculate MLE and modified MLE for the improvement slope and MTBFI(650). Is the power relationship model a reasonable choice for the data? Give a 95% confidence interval for the MTBF at the end of the test. EXERCISE 13.25 A reliability improvement test records failures at 37, 42, 44, 46, 47, and 121 hours. The test ends at 320 hours with no more failures. Draw a Duane plot of the data and calculate MLE and modified MLE for the improvement slope and the MTBF at the end of the test. Is the power relationship model a reasonable choice for these data? EXERCISE 13.26 For Example 13.9, confirm the CN2 test statistic value of 0.860.

510

Applied Reliability

Since the Duane/AMSAA model appears in the above discussions with two different sets of parameters (a and b, or α and β) and other parameters also are used in the literature, further confusing things, two key formulas are summarized below using both sets of parameters: MTBFcum (t) = e αtβ = MTBFI (t) =

t 1−b , a

e αtβ t 1−b = 1 − β ab

The equation for MTBFI(t) is useful for planning how long to run a reliability improvement test. For example, by assuming a conservative growth rate slope of β = 0.3 and a current MTBF value (which we set as the MTBFI at 1 hour, for convenience), it is easy to solve for the test time needed to improve to a desired final MTBF. If previous experience indicates a more aggressive improvement slope can be achieved (i.e., β of 0.4 or 0.5 or even higher), then that can be used. Alternatively, previous tests on similar products may have demonstrated a given amount of improvement over the course of the test. Using the following equations, where IF is the improvement factor (the ratio of the target MTBFI at time T and the starting or reference MTBFI0 at time t0), we can easily go from β to an improvement factor or vice versa: IF =

MTBFI e αT β T = αβ = MTBFI0 e t0 t0

β

;

β = log( IF ) log(T t0 );

T = t0 IF 1/β The Duane model is an empirical model, with several key assumptions. For example, we assume that development testing is done on representative product and simulates actual field usage, that including corrective action is done on observed failures, and that β is constant allowing for extrapolation of constant failure rates. Reliability improvement may have considerable variation because of modifications that occur to products under development. Despite its simplicity, the model can be useful for reliability planning and monitoring. EXAMPLE 13.10 SOFTWARE RELIABILITY IMPROVEMENT In order to improve the quality of a new software product, the software is run through a battery of tests designed to uncover defects (or “bugs”). When a defect is detected, the software code is corrected to remove the bug. The manufacturer has been running the tests for 6 weeks. At the end of the first, second, fourth, and sixth weeks, the cumulative MTBF is estimated as shown in Table 13.9. Assuming improvement at the same rate, how long should the tests be run to achieve a target-objective MTBFI of 350 hours? How long should it take to reach a target MTBFI of 500 hours? TABLE 13.9 Results of Software Evaluation Testing Test Time (Hrs) 168 168 336 336

Cumulative Time (Hrs)

Number of Bugs

Cumulative Bugs

MTBF Cum (Hrs)

168 336 672 1008

3 2 2 1

3 5 7 8

56.0 67.2 96.0 126.0

511

Repairable Systems Part II: Nonrenewal Processes

Solution The Duane model plot, created using log–log axes, is shown in Figure 13.22. In Excel, by right clicking on the data points in the chart and selecting Add Trendline …, we obtain the dialog box shown as Figure 13.23. We select the Power of Trend/Regression Type and check the box to display the equation on the chart. We see that the regression estimates are e αˆ = 5.230 and βˆ = 0.452. The instantaneous MTBFI0 at the end of test t0 = 1008 hours is given by

MTBFI0 (t0 ) =

126.0 MTBFcum (1008) = = 230.0 1− β 1− 0.452

For a target-objective MTBFI of 350 hours, the improvement factor is 350/230 = 1.522. Hence, the test time required, assuming the same growth slope β, is

T = t0 IF 1/ β = 1008(1.522)1/ 0.252 = 2, 552 or an additional 1544 hours. To hit a target objective of MTBFI of 500 hours, the time required is

T = t0 IF 1/ β = 1008(500 230.0)1/ 0.252 = 5, 618 hours

EXERCISE 13.27 Based on past experience, a prototype of a new manufacturing tool is expected to have an MTBFI of 50 hours at the start (at 1 hour) of a reliability improvement test and an MTBFI of 500 hours at the end of a 1000-hour test. What is the reliability growth slope? If the customer requirement is an MTBF of 400 hours, how long should the test be run?

MTBF (hours)

1000

100

y = 5.2297x0.4521 MTBFcum MTBFI Regression line

10 100 FIGURE 13.22 Duane plot of software cumulative MTBF estimates.

1000 Age t (hours)

10000

512

Applied Reliability

FIGURE 13.23 Excel trendline dialog box.

EXERCISE 13.28 How much improvement would take place between the 1st and the 500th hour of testing when β is 0.5?

Much useful information concerning the practical and analytical aspects of reliability growth testing can be found in U.S. MIL-HDBK-189 Reliability Growth Management and U.S. MIL-STD-1635(EC) Reliability Growth Testing. Kececioglu (1991) has a detailed section on the Duane and AMSAA (power relationship) models that contains several worked out examples. Also, note that the estimation formulas in this section only apply to one system undergoing reliability improvement testing (and the MTBF projection applies to all later systems that incorporate the improvements made on the one system during the test). If several systems are on test (with possibly different system ages), the estimation formulas are more complicated (see Section 13.8 and Crow 1974, 1993).

13.11 Simulation of Stochastic Processes We have mentioned in several previous chapters that simulation is a powerful tool for generating and understanding random events. The simulated data allow us to investigate the validity and range of applicability of analysis procedures. In situations where theory is inadequate or nonexistent or the models are too complicated for solution, simulation studies may be the only viable approach. Additionally, simulation may be the source of data to illustrate

513

Repairable Systems Part II: Nonrenewal Processes

important concepts. In this section, we discuss simulation techniques for renewable and repairable processes. In particular, we show how to simulate both the HPP and the NHPP. We provide an example of the simulation of the NHPP under the power law relation and thereby show the application of analysis methods described in Sections 13.3 through 13.5. Earlier we stated that a stochastic process is a collection of random variables describing the evolution through time of a process. To simulate a stochastic process, we thus simulate a sequence of random variables. For example, consider a time-censored renewal process in which a random number of N repairs occur by system age t. The interarrival times have a specific distribution F. We simulate independent random variables X 1 , X 2 , ... with distribution F and stop at the minimum N + 1 when the sum X 1 + X 2 + + X N + X N +1 > t . Thus, the X i , i ≥ 1, are the simulated interarrival times, and the simulation yields N repairs occurring at system ages X 1 , X 1 + X 2 , ... , X 1 + + X N . If we want to simulate an HPP having rate λ, then the distribution F for simulating the X i , i ≥ 1 is exponential with failure rate λ. Alternative methods for simulating a Poisson process are described in Ross (1993). To simulate an NHPP having intensity λ(t) for 0 ≤ t < ∞, we use an approach presented in Ross (2006). We will simulate the successive repair times. Let Y1 , Y2 , ... denote the system ages at repair. Since these random variables are dependent, we need to find the conditional distribution of Yi given Y1 , ... , Yi−1. Ross shows that if a repair occurs at time y, then, independent of what occurs prior to y, the time t until the next event at time y + t has the distribution FY given by FY (t) = 1 − exp −

t

∫ λ(y + τ)dτ 0

Thus, we simulate the first repair time Y1 from F0 (using y = 0 in the integral). If the simulated value of Y1 is y1, we simulate Y2 by adding y1 to a value generated from FY1 (using y = y1 in the integral) and calling the sum y 2 . Similarly, Y3 is simulated by adding y 2 to a value generated from FY2 with sum y 3, and so on. If the distribution FY can be inverted, the inverse transform method described in Section 1.9 can be used. Let us now illustrate the approach for the case where the intensity function is described by the power relation model. EXAMPLE 13.11 SIMULATING AN NHPP WITH POWER RELATION INTENSITY The intensity is given by the model

λ(y ) = aby b−1 Then, t

t

∫ λ(y + τ)dτ = ∫ ab(y + τ)

b −1

0

dτ = a (y + t )b − y b

0

So,

{

FY (t ) = 1 − exp − a (y + t )b − y b

}

The distribution FY can be inverted, and after some algebra to solve for t, we get

t = yb −

1 ln(1 − FY ) a

1/ b

−y

514

Applied Reliability

By the inverse transform method, we substitute the random unit uniform variable 1 − U for FY to get

FY−1(u ) = y b −

1 ln(u ) a

1/ b

−y

We can thus simulate the successive repair ages Y1,Y2 , ... by generating U1, U2 , ... and applying the formulas (where Y0 = 0) 1 Y1 = − lnU1 a

1/ b

,

1 Y2 = Y1b − lnU2 a

1/ b

1 − Y1 + Y1 = Y1b − lnU2 a

1/ b

and in general,

Yi = Yi b−1 −

1 ln Ui a

1/ b

The spreadsheet function RAND() can be used to generate the Ui . EXAMPLE 13.12 SIMULATING THE FIRST SIX REPAIR TIMES FOR NHPP WITH SPECIFIED POWER RELATION MODEL Let the parameters of the intensity function for the power relation model be a = 0.25 and b = 0.5, that is, the intensity function is

λ(t ) = (0.25)(0.5)t (0.5−1) = 0.125t −0.5 We will simulate the system ages for the first six repairs. Using a unit uniform pseudorandom number simulator, similar to what is available in many handheld calculators, we generate the numbers 0.2727, 0.8305, 0.8772, 0.3344, 0.1143, and 0.7004. The first simulated repair time is

y1 = y00.5 −

1 ln(0.2727) 0.25

1/ 0.5

= 27.0

where y0 = 0. The second simulated time is

y 2 = ( 27.0)0.5 −

1 ln(0.8305) 0.25

1/ 0.5

= 35.3

Continuing in this manner, we get the remaining repair ages 41.8, 117.6, 380.9, and 438.5. EXERCISE 13.29 Simulate the first 10 repair times for a system modeled by the NHPP having a power relation intensity with parameters a = 0.10 and b = 0.75. Graph the results and test the null hypothesis of an HPP. EXERCISE 13.30 Simulate the first 10 repair times for a system modeled by the NHPP having an exponential relation intensity with parameters a = 0.001 and b = 0.005. Graph the results and test the null hypothesis of an HPP.

Repairable Systems Part II: Nonrenewal Processes

515

13.12 Summary In this chapter, we have presented some keys concepts on repairable systems. While far from an exhaustive treatment, these ideas illustrate how different approaches are useful for handling data from repairable systems. Graphical tools are very important to precede analytical procedures. Methods of modeling and tests of trend for repairable system data have been discussed. We have emphasized the distinction between analysis of repairable and nonrepairable systems. Different analysis situations for renewal and nonrenewal data have been illustrated. Simulation procedures for renewal processes and the NHPP have been described. We see that the subject of repairable system reliability has many important and practical applications. Reliability growth modeling using the Duane–Crow–AMSAA model has been used in industry with a history of successful applications. The model applies to testing situations where failures are rapidly traced to root causes and improvements that change system design and subcomponent choices are immediately implemented. The MTBF for the system improves proportional to T β, where T is the length of the reliability improvement test and β is the improvement slope. An alternate way of describing the model sets M(t) (the mean number of failures up to time t) equal to atb (where b = 1 − β). This is equivalent to the power relationship NHPP model discussed in this chapter, and MLEs and confidence bounds can be easily be calculated. An estimate of how long to test in order to reach a desired MTBF level can be calculated after estimating (or assuming values for) the current MTBF and the reliability growth improvement slope.

Problems 13.1 The number of repairs for newly established, continuously operating measuring stations follows an NHPP with intensity linearly decreasing from 1 repair per day initially to 0.1 per day by the end of the 10th week. The rate stays at the constant level of 0.1 repairs per day thereafter. We assume that the number of repairs are independent, day to day. What is the probability that no repairs occur during days 15 through 21? What is the expected number of repairs after 20 weeks? 13.2 The number of accidents at an intersection follows an NHPP having an initial rate of 0.0001 per hour at 6:00 am and increasing linearly to 0.005 per hour by 10:00 am. The rate remains constant at 0.0005 per hour until 7:00 pm. From 7:00 pm, it decreases linearly from 0.0005 per hour to 0.0001 per hour by 10:00 pm, remaining constant until 6:00 am. Assume that the number of accidents occurring during disjoint time periods is independent. What is the probability of no accidents happening during the hours of 12 noon to 2:00 pm? What is the expected number of accidents during this period? 13.3 The repair histories (in hours) on four systems assumed to be independent are as follows: System 1: 142, 309, 460 System 2: 99, 145, 300, 347 System 3: 212, 225, 273, 398, 467 System 4: 21, 58, 150, 176

516

Applied Reliability

Is there any evidence against the null hypothesis of an HPP for each system? Is there any evidence based on the overall test? 13.4 Using the 10 repair ages of Example 13.7, estimate the modified MLE parameters of the power relation model. Plot the original and the estimated intensity functions versus the system ages. Estimate the mean cumulative function for the power law model using the modified MLE parameter values. Compare to the actual data using a cumulative plot. Calculate the goodness-of-fit statistic CQ2 and compare it to the critical values in Table 13.4. 13.5 Describe how you would plan a reliability growth test for a new tool to achieve an MTBF objective of 200 hour at 90% confidence. Assume that the tool will begin the test with an MTBF (at 1 hours) of 80, and past experience indicates a reliability growth slope of 0.3 can be expected. Note that the straightforward approach of using the test time where the MTBFI first reaches 200 hours is not likely to result in actual data that will confirm the objective MTBF as a 90% lower confidence limit. (Hint: Make use of the simulation techniques described in this chapter for the power relationship model and the confidence bound factors in Table 13.7.)

14 Bayesian Reliability Evaluation All the methods and types of analyses described in this text up to now fall in the realm of “classical statistical analysis.” There is another set of tools and techniques that can be used to extract answers from sample data known as “Bayesian methodology.” Bayesian methodology can often yield more information from smaller sample sizes than classical analysis. There is a price, of course. You need to make more assumptions, and the results can be controversial to the point of being not acceptable to some analysts. Many applications of Bayesian methodology encounter a highly polarized reception consisting of strong supporters and equally strong detractors. To the believer, a Bayesian approach offers an intuitively pleasing way to harness past experience and “expert judgment” toward a goal of reducing costs and test time, while still coming up with accurate estimates and sound decisions. This approach is especially attractive when estimating the failure rate of highly reliable components, where we have seen that sample sizes and test times can be very large if we want precise results. On the other hand, those against this approach feel that the price you pay for what almost seems like “something for nothing” is the loss of credibility of the final results. This chapter will first look at the differences between classical and Bayesian analysis, with an emphasis on explaining the different assumptions, benefits, and risks behind each approach. Then, we will show how Bayesian methods can be applied to solve some common reliability problems. The applications described in detail will be carried out using spreadsheet software. There are other powerful Bayesian techniques that are either much more advanced than the material covered here or require special software. These will be briefly described in Section 14.7 at the end of the chapter (with references for readers who want to explore these Bayesian applications further).

14.1 Classical versus Bayesian Analysis It makes a great deal of practical sense to use all the information available, old or new, objective or subjective, when making decisions under uncertainty. This view is especially true when the consequences of the decisions can have a significant impact—financial or otherwise. All of us make every day personal decisions this way, using an intuitive process based on our experience and including subjective judgments. Mainstream statistical analysis, however, seeks objectivity by generally restricting the information used in an analysis to that obtained from a current set of clearly relevant data. Prior knowledge is not used except to suggest the choice of a particular population model to “fit” to the data, and this choice is later checked against the data for reasonableness. Lifetime or repair models, as we saw when we looked at repairable and nonrepairable reliability population models, have one or more unknown parameters. The classical statistical approach considers these parameters as fixed but unknown constants to be estimated using sample data taken randomly from the population of interest. 517

518

Applied Reliability

A confidence interval for an unknown parameter is really a frequency statement about the likelihood that an interval, calculated after observing one of the conceivably very many possible samples we could take from the population, captures the true parameter value. Strictly speaking, one cannot make probability statements about the true parameter since, while unknown, it is fixed and not random. The particular interval calculated either does or does not contain the fixed parameter. It is the interval itself that is random, not the parameter value. The Bayesian approach, on the other hand, treats these population model parameters as random, not fixed, quantities. Before looking at the current data, we use old information, or even subjective judgments, to construct a prior distribution model for these parameters. This model expresses our starting assessment about how likely various values of the unknown parameters are. We then make use of the current data (using Bayes’ formula) to revise this starting assessment, deriving what is called the posterior distribution model for the population model parameters. Parameter estimates, along with confidence intervals (known as credibility intervals in this case), are calculated directly from the posterior distribution. Credibility intervals are legitimate probability statements about the unknown parameters since these parameters are now considered random quantities, not fixed values. It is unlikely in most applications that data will ever exist to validate a chosen prior distribution model. Parametric Bayesian prior models are chosen most often because of their flexibility and mathematical convenience, rather than their fit to past data. In particular, conjugate priors (defined in the next section) are a natural and popular choice of Bayesian prior distribution models. 14.1.1 Bayes’ Formula, Prior and Posterior Distribution Models, and Conjugate Priors Bayes’ formula is a useful equation from probability theory that expresses the conditional probability of an event A occurring, given that the event B has occurred (written P(A|B)), in terms of unconditional probabilities and the probability the event B has occurred, given that A has occurred. In other words, Bayes’ formula inverts which of the events is the conditioning event. The key formula (see Section 1.5) is

P( A|B) =

P( AB) P(A) × P(B|A) = P(B) P(B)

and P(B) in the denominator is further expanded by using the so-called law of total probability (Section 1.5) to write n

P(B) =

∑ P(B|A )P(A ) i

i

i=1

with the events Ai being mutually exclusive and exhausting all possibilities and including the event A as one of the Ai. The same formula, written in terms of probability density function models, takes the form

g(λ|x) =

f (x|λ)g(λ) ∞

∫ f (x|λ)g(λ)dλ 0

519

Bayesian Reliability Evaluation

where f ( x|λ) is the probability model, or likelihood function, for the observed data x, given the unknown parameter (or parameters) λ, g(λ) is the prior distribution model for λ and g(λ|x) is the posterior distribution model for λ given that the data x have been observed. The integral in the denominator has all the prior parameters integrated out and is known as the marginal distribution of f(x). Note that x typically represents a vector sample of observations, which may be a complete sample x1, x2, … , xn or a censored sample. This key formula for applying Bayesian methodology is sometimes written as

g(λ|x) ∞LIK(x|λ)g(λ)

where ° means the posterior distribution is proportional to the likelihood function for the observed data, multiplied by the prior density. The prior distribution model g(λ) will itself contain one or more parameters that have to be assigned values in order to calculate the posterior distribution density g(λ|x) using the formula shown above. Sometimes, there are additional parameters in the prior we are not interested in, which are called nuisance parameters. When the posterior distribution g(λ|x) is in the same family as the prior distribution g(λ) but with different parameter values, g(λ) and g(λ|x) are said to be conjugate distributions. Also, g(λ) is called the conjugate prior for the probability model f (x|λ). For example, the beta distribution model is a conjugate prior for the proportion of successes p when samples have a binomial distribution, and the gamma model is a conjugate prior for the failure rate λ when sampling failure times or repair times from an exponentially distributed population. This latter conjugate pair (gamma, exponential) is used extensively in Bayesian system reliability applications. We will make use of this conjugate pair when we look at the Bayesian approach for analyzing exponential failure data in the next section. 14.1.2 Bayes’ Approach for Analysis of Exponential Lifetimes Assume we have a population of components with lifetimes that are described by an exponential distribution with parameter λ. We have shown, in Chapter 3, how to estimate λ from a sample of failure and survival data and have shown how to calculate an interval that we believe will contain the true (but unknown) λ with a given degree of confidence. It is worthwhile repeating the explanation given in Chapter 3 for the meaning of this confidence interval for λ to help us understand the fundamental differences between classical and Bayesian approaches to reliability. In Chapter 3, we stated “A 90% confidence interval means that if the same experiment were repeated many times, and the same method were used over and over again to construct an interval for λ, 90% of those intervals would contain the true λ. For the one time we actually do the experiment, our interval either does or does not contain λ.” This view is the “classical” frequency notion of probability. The probability of an event is the success ratio of that event in a repeated series of trials as the number of repetitions approaches infinity. We can conceive of repeating a life test experiment over and over again, and we can define an event A = {the interval calculated at the end of the experiment contains the true value of the population failure rate λ}. However, within this classical framework, we cannot even ask questions like “What is the probability that the true λ lies within a particular calculated interval?” In the classical approach, λ is a fixed parameter

520

Applied Reliability

that either does or does not lie within the random interval. There is no repeated sequence of trials or a success ratio to consider. There are only observable events that have a probability we can calculate, assuming we know the value of λ. The Bayesian approach uses a different concept of probability, known as “subjective probability.” This view is related more to our “degree of belief” in something, rather than a frequency approach. An axiomatic theory can be developed for this concept (Savage 1954; Lindley 1965). For our purposes, the Bayesian approach amounts to assuming that the unknown parameters that specify a life distribution (such as λ, T50, σ, α, or β) are themselves random variables, chosen from a “prior” distribution that expresses our previous knowledge or intuition about these parameters. If we know the prior distribution, we can calculate the probability that λ lies within any particular interval. Within the Bayesian framework, the exponential failure rate parameter λ is a random variable assumed to have a CDF G(λ) and PDF g(λ). G(λ) is the prior distribution for λ and is a mathematical summary of all our previous knowledge or intuition about λ. After running an experiment and obtaining new data, this prior distribution is revised, using Bayes’ equation, to obtain a new updated distribution for λ. This new distribution is the posterior distribution for λ. The mean of the posterior distribution, or expected value of λ after observing t, is known as the Bayes point estimate for λ. This expectation, E(λ|t), is given by ∞

E(λ|t) =

∫ λg(λ|t)d λ 0

An upper (1 − α) bound for λ is obtained by solving for λ1−α in the equation ∞

α=

∫ g(λ|t)d λ

λ 1− α

If we let α = 0.5 and solve for λ50, this yields another estimate for λ (the median of the posterior distribution), which is sometimes used instead of E(λ|t). Once we have calculated the Bayesian estimate of λ1−α from the posterior distribution, we also have the Bayesian estimate for the lower bound of the mean time between failures (MTBF), since MTBFα = 1/λ1−α. The above formulas show how to estimate λ and calculate percentiles for g(λ|t) for any choice of a prior distribution. In particular, when f(t|λ) is an exponential distribution, it is common to choose a gamma (conjugate prior) model for g(λ). The formula for the gamma PDF, using parameters a and b, is

g(λ) =

b a λ a−1 e− bλ ; a > 0, b > 0 Γ(a)

Γ is the gamma function defined in Section 4.2, and Γ(a) = (a − 1)!, provided a is a positive integer. The mean of this prior distribution, E(λ), is just a/b, and the variance

521

Bayesian Reliability Evaluation

is σ² = a/b². The gamma distribution was described in Section 10.4, where r was used for the parameter now called a and λ was used as the second parameter instead of b. Be careful to distinguish the use of λ in this chapter as a random variable, which has a gamma distribution, and the use of λ in Chapter 10 as one of the gamma distribution parameters. Note that a common literature parameterization for the gamma would use a as defined here, calling it a “shape parameter,” but using the Greek letter σ (or sometimes γ, k, or r) for this parameter. It is also common to use β = 1/b, instead of b, as the “scale parameter.” We use b in this chapter because it will turn out to naturally correspond to test time. Without any testing, our initial estimate of λ is just E(λ) or a/b. If our testing produces r failures in T total unit hours of test time, then the posterior distribution, which is also a gamma distribution, can be shown to have new parameters: a = a + r ; b = b + T . The new failure rate estimate, using both the prior knowledge and the test results, is given next. The Bayes point estimator of λ is just (a + r)/(b + T), and the MTBF (or mean time to failure [MTTF]) estimate is

MTBF = MTBF = b /a = (b + T )/(a + r )

Next we calculate an upper bound for λ using Excel’s gamma distribution–related functions. These are GAMMADIST(y,α,β, TRUE) for the CDF at y and GAMMAINV(p,α,β,) for the inverse CDF function or quantile function. Excel uses parameters α = a and β = 1/b; so an upper 95% bound for λ is given by

λ 95 = GAMMAINV(.95, a , 1/b )

and a lower 95% bound for the MTTF or MTBF is the reciprocal of the upper 95% bound for the failure rate or 1/λ 95. Some useful relations between the gamma and chi-square distributions are shown in the Appendix 14.1A. Note that both Minitab and JMP have gamma CDF and CDF inverse functions similar to that of Excel, with the same parameterization as Excel. EXAMPLE 14.1 LOWER MTBF BOUNDS USING A BAYESIAN GAMMA PRIOR A gamma prior with a = 1 and b = 1000 was chosen for a system with assumed exponential interarrival times for repairs (a homogeneous Poisson process [HPP] system) having MTBF = b/a = 1000. A new test was run for 1200 hours when a repair action was needed. Find a point estimate and a 90% lower bound for the system MTBF.

Solution We have aʹ = 1 + 1 = 2 and b ʹ = 1000 + 1200 = 2200. The MTBF point estimate is 2200/2 = 1100 hours, and a 90% lower bound is obtained using the Excel expression 1/GAMMAINV(.90,2,1/2200). This produces a lower bound of 566 hours. It is instructive to look at a plot of the gamma prior and posterior specified in the solution to Example 14.1. These PDFs are shown in Figure 14.1. In Excel, the prior is created by plotting GAMMAINV(x,1,1/1000,0) versus x, and the posterior by plotting GAMMAINV(x,2,1/2200,0) versus x, where x is suitably chosen plotting values on the horizontal axis. Note how the new test data modify the prior so that much of the weight is now concentrated in the region close to the posterior λ mean of 1/2200 = 0.00045.

522

Applied Reliability

Gamma density function f(x)

1200 1000

PDF

800

Prior Posterior

600 400 200 0

0

0.001

0.002

0.003

0.004 x

0.005

0.006

0.007

0.008

FIGURE 14.1 Bayesian gamma prior and posterior from Example 14.1.

14.2 Classical versus Bayes System Reliability We compare the classical paradigm versus the Bayesian paradigm when system reliability follows the HPP or exponential model (i.e., the flat portion of the bathtub curve). 14.2.1 Classical Paradigm for HPP System Reliability Evaluation

1. The MTBF is one fixed unknown value; there is no “probability” associated with it. 2. Failure data from a test or observation period allow you to make inferences about the value of the true unknown MTBF. 3. No other data are used and no “judgment”—the procedure is objective and based solely on the test data and the assumed HPP model.

14.2.2 Bayesian Paradigm for HPP System Reliability Evaluation

1. The MTBF is a random quantity with a probability distribution. 2. The particular piece of equipment or system you are testing “chooses” an MTBF from this distribution, and you observe failure data that follow an HPP model with that MTBF. 3. Prior to running the test, you already have some idea of what the MTBF probability distribution looks like based on prior test data or a consensus judgment of knowledgeable engineers. 4. A typical choice g(λ) for a prior model for λ = 1/MTBF is the gamma distribution (the conjugate prior for the exponential interarrival failure times of an HPP). Using this prior ensures that the posterior distribution g(λ|x) is also gamma.

14.2.3 Advantages and Disadvantages of Using Bayes’ Methodology While the primary motivation to use Bayesian reliability methods is typically a desire to save on test time and materials cost, there are other factors that should also be taken into account. Table 14.1 summarizes some of these “good news” and “bad news” considerations.

523

Bayesian Reliability Evaluation

TABLE 14.1 Bayesian Paradigm: Advantages and Disadvantages Pro’s 1. Uses prior information—this “makes sense.” 2. If the prior information is encouraging, less new testing may be needed to confirm a desired MTBF. 3. Confidence intervals are really intervals for the (random) MTBF and are called “credibility intervals.”

Con’s 1. Prior information may not be accurate— generating misleading conclusions. 2. The way of inputting prior information (choice of prior) may not be correct. 3. Customers may not accept the validity of any prior data or engineering judgments. 4. There is usually no one “correct way” of inputting prior information and different approaches can give different results.

14.3 Bayesian System MTBF Evaluations We assume the following:

1. Interarrival failure times for the system under investigation can be adequately modeled by the exponential distribution. For repairable systems, this means the HPP model applies and the system is operating in the flat portion of the bathtub curve. 2. The MTBF for the system can be regarded as chosen from a prior distribution model that is an analytic representation of our previous information or judgments about the system’s reliability. The form of this prior model is the gamma distribution (the conjugate prior for the exponential model). The prior model is actually defined for λ = 1/MTBF since it is easier to do the calculations this way. 3. Our prior knowledge is used to choose the gamma parameters a and b for the prior distribution model for λ.

How can we make use of prior knowledge to choose gamma parameters? There are many possible ways to convert “knowledge” to gamma parameters, depending on the form of the knowledge. We will describe two approaches.

1. If you have actual data from previous testing done on the system (or a system believed to have the same reliability as the one under investigation), this is the most credible prior knowledge and the easiest to use. Simply set the gamma parameter a equal to the total number of failures from all the previous data, and set the parameter b equal to the total of all the previous test hours. 2. A consensus method for determining a and b that works well is the following: Assemble a group of engineers who are knowledgeable about the system and its subcomponents as far as reliability is concerned. a. Have the group reach an agreement on a reasonable MTBF they expect the system to have. Each person could pick a number that they would be willing to bet even money that the system would either meet or miss, and the average or median of these numbers would be their 50% best guess for the MTBF. Or, they could just discuss even-money MTBF candidates until a consensus is reached.

524

Applied Reliability

b. Repeat the process again, this time reaching agreement on a low MTBF they expect the system to exceed. A “5%” value that they are “95% confident” the system will exceed (i.e., they would give 19 to 1 odds) is a good choice. Or, a “10%” value might be chosen (i.e., they would give 9 to 1 odds the actual MTBF exceeds the low MTBF). Use whichever percentile choice the group prefers. c. Denote the reasonable MTBF by MTBF50 and the low MTBF you are 95% confident the system will exceed by MTBF05. These two numbers uniquely determine gamma parameters a and b that have λ percentile values at the right locations (λ50 = 1/MTBF50 and λ 95 = 1/MTBF05).

We call the second method of specifying gamma prior parameters the 50/95 method (or the 50/90 method if we use MTBF10, etc.). A simple way to calculate a and b for this method using Excel will be described in Section 14.3.1. Many variations are possible based on the above two methods. For example, we might have prior data from sources that we do not completely trust. Perhaps we question whether the data really apply to the system under investigation. We might decide to “weight” the prior data by 0.5, to “weaken” it. This can be implemented by setting a = 0.5 multiplied by the number of fails in the prior data and b = 0.5 times the number of test hours. That approach spreads out the prior distribution and permits greater variation based on the information contained in new test data. Another option might be to calculate a and b using the 50/95 method and calculate new values using (possibly weighted) previous test data. Add the two a’s to get a final a and the two b’s to get a final b. 14.3.1 Calculating Prior Parameters Using the 50/95 Method We now describe how to obtain the prior gamma parameters a and b for the 50/95 method and indicate the minor changes needed when any two other MTBF percentiles are used. The step-by-step procedure is

1. Calculate the ratio RT = MTBF50 /MTBF05. 2. Open an Excel spreadsheet and put any starting value guess for a in A1, say 2. 3. Move to B1 and type the following expression:

GAMMAINV(.95, A1, 1) / GAMMAINV(.5, A1, 1)

4. Press enter and a number will appear in B1. We will use the Goal Seek tool in Excel to vary A1 until the number in B1 equals RT. 5. In early versions of Excel, click on Tools (on the top menu bar) and then on Goal Seek. Starting with Excel 2007, Goal Seek is found in the Data tab, under What-If Analysis as shown in Figure 14.2. 6. When Goal Seek is selected, a box will open. Click on Set Cell and highlight cell B1. $B$1 will appear in the Set Cell window. Click on To Value and type in the numerical value for RT. Click on By Changing Cell and highlight A1 ($A$1 will appear in By Changing Cell). Now click OK and watch the value of the parameter a appear in A1. 7. Go to C1 and type

= .5 * (MTBF50 ) * GAMMAINV(.5, A1, 2)

and the value of b will appear in C1 when enter is hit.

525

Bayesian Reliability Evaluation

EXAMPLE 14.2 CALCULATING PRIOR PARAMETERS USING THE 50/95 METHOD A group of engineers investigating the reliability of a new piece of equipment decide to use the 50/95 method to convert their knowledge into a Bayesian gamma prior. Consensus is reached on a likely MTBF50 value of 600 hours and a low MTBF05 value of 250. Find the prior parameters a and b.

Solution RT is 600/250 = 2.4. Figure 14.3 shows the Excel spreadsheet just prior to clicking OK in the Goal Seek box. After clicking OK, the value in A1 changes from 2 to 2.862977. This new value is the prior “a” parameter a. (Note: If the group felt 250 was an MTBF10 value, instead of an MTBF05 value, then the only change needed would be to replace 0.95 in the B1 equation by 0.90. This would be the 50/90 method.) Figure 14.4 shows what to enter in C1 to obtain the prior parameter b value of 1522.46. The gamma prior with parameters a = 2.863 and b = 1522.46 will have (approximately) a probability of 50% of being below 1/600 = .001667 and a probability of 95% of λ being below 1/250 = .004. This result can be checked by typing

FIGURE 14.2 Calling up Goal Seek.

FIGURE 14.3 Using Goal Seek to find the gamma prior a parameter a.

526

Applied Reliability

FIGURE 14.4 Calculating the gamma prior b parameter b.

= GAMMDIST(.001667, 2.863, (1/ 1522.46), TRUE) and

= GAMMDIST(.004, 2.863,(1/ 1522.46), TRUE) as described when gamma Excel functions were introduced in Section 10.4.

14.3.2 Calculating the Test Time Needed to Confirm an MTBF Objective Once we have determined gamma prior parameters a and b, we can go on to calculate how long we need to run a test. Assume we have a given MTBF objective, say M, and a desired confidence level, say 100 × (1 − α). We want to confirm that the system will have an MTBF of at least M at the 100 × (1 − α) confidence level. As in Section 3.12 on classical exponential test plans, we start by picking a number of failures, r, that can occur during the test and still allow us to confirm our objective. We need a test time T such that we can observe up to r failures and still “pass” the test. If this test time turns out to be too long (or too short), we can iterate with a different choice of r. When the test ends, the posterior gamma distribution will have (worst case, assuming exactly r failures) new parameters of a’ = a + r, b’ = b + T and passing the test means that the failure rate or repair rate λ1−α (the upper 100 × (1 − α) percentile for the posterior gamma) has to equal the target failure rate 1/M. This percentile is, by definition, G−1(1 − α; a′, b′), with G−1 denoting the inverse of the gamma distribution CDF with parameters a′, b′. We have to find the value of T that satisfies G−1(1 − α; a′, b′) = 1/M. Based on the properties of the gamma distribution, it turns out we can calculate T directly by the equation T = 0.5M × G−1(1 − α, a′,0.5) − b or the equivalent Excel equation = 0.5*M*GAMMAINV(1 − α,a + r,2) − b. EXAMPLE 14.3 CALCULATING A BAYESIAN TEST TIME Continuing with the equipment reliability investigation described in Example 14.2, we might want to determine a test time that can confirm an MTBF of 400 with 90% confidence. We decide to allow up to 1 failure (r = 1). The time needed can be calculated by the Excel formula

= .5*400 *GAMMAINV(.90,( 2.863 + 1), 2) – 1522.46

527

Bayesian Reliability Evaluation

and this results in a test time of 1077 hours. As a check, we see that if we do get exactly one failure, the posterior CDF evaluated by = GAMMADIST((1/400),3.863,1/(1522+1077),1) is 0.90, and this confirms that 400 hours is a 90% lower bound for the MTBF. We can compare this test time to the classical test time derived by using Table 12.8. The factor for 90% and r = 1 is 2.99. We find the classical test time needed by multiplying this factor by the MTBF objective and get 400 × 2.99 = 1196 hours. The Bayesian test time saves 119 hours of testing in this example.

EXAMPLE 14.4 A MINIMUM BAYESIAN TESTING TIME CALCULATION We are planning a qualification test for a complex system that is required to meet an MTBF specification of 1000 hours. Assuming the failure distribution is exponential, we would have to test for a minimum of 2303 hours, with no failures, in order to be 90% confident that the system will have the specified reliability (using the classical minimum test time calculation based on Table 12.8 for r = 0 and 90%). However, the vendor has supplied credible data indicating that a virtually identical system has been tested 1000 hours with only one failure. What is the minimum Bayesian test time needed in order to be 90% confident that the system MTBF is at least 1000 hours, assuming we have no failures?

Solution We start by using the vendor testing data to set prior parameters a = 1 and b = 1000. After a test of T hours with no failures, the parameters of the gamma posterior distribution will be a′ = a + 0 = 1 and b′ = b + T = 1000 + T. We want a value of T such that the 90th percentile of the posterior distribution equals the failure rate objective. With that T and no failures, we will be 90% confident that the system meets its objective. The Excel equation for T is = 0.5*M*GAMMAINV((1 − α), (a + r), 2) − b or = 0.5*1000*GAMMAINV(0.9, 1, 2) − 1000 and the result is T = 1303 hours. Note that in this example, the use of a Bayes prior allows us to reduce the minimum required test time from around 2300 hours to about 1300 hours. The Bayes test time is more than 43% less than the time required by classical methods. All we did to achieve this significant test time reduction was make use of credible vendor data. Does a Bayesian approach always save test time? No—it all depends on the assumed parameters for the gamma prior (see Exercise 14.1). However, whenever the prior parameter a is 1 or smaller, we will always end up with a shorter test time using the Bayesian approach than using the classical approach. In particular, when the prior a is exactly 1 (as in Example 14.4), we have the interesting result that the Bayesian test time will equal the classical test time minus the prior parameter b (see Exercise 14.2). That means we could have solved Example 4.1 without any calculations other than finding the classical minimum test time of 2300 hours. Since a = 1 and b = 1000, we immediately know that the Bayesian test time is 2300 − 1000 = 1300 hours.

EXERCISE 14.1 Repeat Example 14.4, but this time assume the vendor testing yielded two failures in 1000 hours. Use a prior with a = 2 and b = 1000. Is the Bayesian test time still shorter than the corresponding classical test time of 2300 hours? EXERCISE 14.2 Using the general formula for the posterior PDF, show that a gamma prior for λ with parameters a and b does lead to a gamma posterior with parameters a = a + r ; b = b + T when r failures are observed in T test hours of a system (assuming an HPP model applies).

528

Applied Reliability

EXERCISE 14.3 a. Derive the formula for Bayesian test time given by T = 0.5M × 1G−1 (1 − α; a′, 0.5) − b. b. Show that when a = 1, the Bayesian test time is always the classical test time minus b. Hints: The identities in Appendix A.1 state that if Y has a G(y; r + 1, b ′) distribution, then 2b′Y has a χ 22(r+1) distribution and also G(y; r + 1, b′) = G(2b′y; r + 1, 0.5). The defining equation for the Bayesian test time is G(1/M; a + r, b + T) = 1 − α and, using the identities, this also equals G(2(b + T)/M; a + r, 0.5). Take the inverse function G−1 of both sides and solve for T. The factor for computing test lengths is .5χ 22(r+1);1−α (see Section 12.8 and the discussion before Table 12.8), and so to derive b, show that

G −1 (1 − α ; a , 0.5) = χ12−α ;2( a+1) EXERCISE 14.4 Find the value of the Bayesian prior a in Example 14.4 (with b = 1000) that would lead to a Bayesian test time equal to the classical test time. (Hint: Either use trial and error or Goal Seek.) EXERCISE 14.5 Again use the setup in Example 14.4, but assume no prior testing data are available. Instead, a panel of experts agree that they would bet even money on the system having an MTBF of 1500 hours and they believe there is only a 5% chance the MTBF is as low as 800 hours. Find the Bayesian parameters a and b that match this expert consensus and determine a minimum Bayesian test time needed to confirm an MTBF of 1000 hours at 95% confidence. What would the test time be if we want to allow one failure and still pass? Compare these test times to the corresponding classical test times.

The next example illustrates the kind of reasoning sometimes used to come up with Bayesian prior parameters. EXAMPLE 14.5 USING ENGINEERING JUDGMENT TO ARRIVE AT BAYESIAN PRIOR PARAMETERS Previous testing on a new tool had 11 failures in 1400 hours. However, 10 of these failures were due to the same poorly designed mechanical arm. A new arm assembly has replaced the old, and the new arm has a proven history of reliable operation with an MTBF in excess of 10,000 hours. It is desired to set up a qualification test for the redesigned tool making use of the old data. The goal is to confirm a tool MTBF of 500 hours at 80% confidence. Use this information to come up with reasonable Bayesian gamma prior parameters and compute the required Bayesian test time. Compare this to the corresponding classical test time.

Solution It makes sense to assume that, with the new arm, probably none of the mechanical arm failures previously seen would have occurred, but surely no more than one. So prior data of two failures in 1400 hours are a conservative estimate for the redesigned tool. The assumed prior data are 2 failures in 1400 hours, and the gamma prior, therefore, has parameters a = 2, b = 1400. If we want to allow up to two failures on the qualification test, we can use Excel to solve for T with the formula =0.5*500*GAMMAINV(0.8,4,2)−1400. This yields T = 1358 hours. An r of 1 would yield a T of 740 hours, and a minimum test time (r = 0) would produce a T of only 97 hours. The corresponding classical test times are 2140, 1500, and 805 hours.

529

Bayesian Reliability Evaluation

EXAMPLE 14.6 MTBF ESTIMATE AFTER TEST IS RUN Continuing Example 14.5, suppose you test for 1358 hours and get only one failure. Then, the posterior parameters are a = 2 + 1 = 3; b = 1400 +1358= 2758. You have confirmed, at 80% confidence, an MTBF of =1/GAMMAINV(.8,3,1/2758), which is 644.5 hours (somewhat better than the goal of 600 hours at 80% confidence).

Many of the examples shown resulted in a Bayesian test time that was shorter, sometimes much shorter, than the classical time. Although this will not always be the case, it is clear why Bayesian system reliability methodology can be very attractive when testing is expensive and/or there are time constraints. Before committing to a Bayesian methodology, however, review the comments in Section 14.2 and make sure the results of a Bayes system qualification analysis will be acceptable to all concerned parties.

14.4 Bayesian Estimation of the Binomial p Applications of the binomial distribution in quality control and reliability evaluations were described in Chapter 11. Recall that the general expression for the binomial distribution, which gives the probability of exactly x failures in n trials with probability of failure p per trial, is

P(X = x) =

n x p (1 − p)n− x , x

x = 0, 1, 2 , … , n. In Chapter 11, the parameter p was considered to be a fixed, but unknown, number between 0 and 1. Following the Bayesian framework, we now consider p to be a random variable with a prior distribution. The natural choice for a prior is the beta distribution since, as we show next, this is the conjugate prior for the binomial distribution. For a random variable p having a (prior) beta distribution with parameters a > 0 and b > 0, the PDF in the unit interval is given by

g (p) =

p a−1 (1 − p)b−1 B(a, b)

where B(a,b), in terms of gamma functions (see the discussion in Chapter 4 after Exercise 4.6), is

B(a, b) =

Γ(a)Γ(b) Γ(a + b)

For a and b integers,

B(a, b) =

(a − 1)!(b − 1)! . (a + b − 1)!

530

Applied Reliability

The posterior distribution g(p|x), after observing x failures out of n trials, is given by

g(p|x) =

⎛ n⎞ x p a−1 (1 − p)b−1 n− x ⎜⎝ x ⎟⎠ p (1 − p) B(a, b) 1

∫

0

=

⎛ n⎞ x p a−1 (1 − p)b−1 n− x − ( 1 ) p p dp ⎜⎝ x ⎟⎠ B(a, b)

p a+ x−1 (1 − p)b+ n− x−1 . B(a + x , b + n − x − 1)

The last expression is just the beta PDF with parameters a = a + x , b = b + n − x, showing that the beta is the conjugate prior associated with the binomial. We have also shown that the parameters of the posterior beta are the old a parameter increased by the number of observed failures x and the old b parameter increased by the number of nonfailures n − x. The mean and variance of the beta prior are

mean =

a ab , variance= a+b (a + b)2 (a + b + 1)

and the mean and variance of the posterior beta are

mean =

a+x (a + x)(b + n − x) , variance= a+b+n (a + b + n)2 (a + b + n + 1)

The Bayesian estimate of p is the mean of the posterior beta, and appropriate quantiles of the posterior beta distribution CDF give a 100 × (1 − α)% credibility interval for p. For example, using Excel spreadsheet functions and notation, a 90% credibility interval for p is given by

[BETAINV(0.05, a + x , b + n − x), BETAINV(0.95, a + x , b + n − x)]

These last equations show what to do when we have a beta prior for p, and we observe x failures out of a sample of n. What is left is the usual question that comes up when applying Bayesian methodology: How do we assign values for the prior parameters a and b? One common approach is use a = 1 and b = 1, which is the same as picking a uniform distribution on the interval (0,1) for p as a prior, giving all possible values of p equal weight. Starting with this choice of a prior, the posterior beta has parameters a = 1 + x , b = n − x + 1. The next example shows how we can start with a uniform prior and then update that prior using past information. This step is followed by using new sample data to obtain a final posterior beta distribution. EXAMPLE 14.7 BAYESIAN ESTIMATION AND CREDIBILITY INTERVALS FOR P In Section 11.3, we considered the case where we sampled 100 units from a population of 50,000 integrated circuits and stressed them for a particular mode of failure. At the end of the experiment,

531

Bayesian Reliability Evaluation

we observed four failures. The classical estimate for p, the population percent defective, was 4/100 or 4%, and a 90% confidence interval was shown to be from 1.38% to 8.92%. Now assume we have prior data based on a sample of 250 units taken from an earlier lot of these circuits and this sample yielded eight failures. How can we make use of this earlier data to obtain both a Bayesian estimate and a 90% credibility interval for p?

Solution We start with a uniform prior and update that prior using the sample data results from the earlier lot. The updated prior has parameters a = 1 + 8 = 9 and b = 250 − 8 + 1 = 243. The posterior beta after observing the results from the new sample of 100 would have parameters a = 9 + 4 = 13, b = 243 + 96 = 339. The final Bayesian estimate for p is 13/352 = 0.037, and a 90% credibility interval is (0.022, 0.055). The results are summarized in Table 14.2. Figure 14.5 shows graphically how the prior uniform density (a constant of 1 between 0 and 1) differs from the peaked posterior beta distributions that have most of their weight concentrated around the mean values of 0.36 and 0.037, respectively. The larger the sample size of the new data, the more peaked and concentrated the posterior will be.

An alternative approach to the Bayesian analysis shown in Example 14.7 might be to pool the two samples and do a classical analysis on 12 failures from a sample of 350 units. TABLE 14.2 Beta Distribution Parameters Beta Distribution Parameters

Start

First Posterior

Second Posterior

1 1

9 243 3.57%

13 339 3.69%

Lower credibility limit

1.88%

2.20%

Upper credibility limit

5.68%

5.48%

a b p (mean)

Beta density functions f(x)

14 12

PDF

10

Prior Posterior 1

8

Posterior 2

6 4 2 0

0

0.05

0.1

FIGURE 14.5 Bayesian beta prior and posterior from Example 14.6.

x

0.15

0.2

0.25

532

Applied Reliability

That would give an estimate for p of 0.034 and a 90% interval of (0.020, 0.055). However, this interval would be a classical confidence interval rather than a probability statement about the location of p. If one makes the plausible assumption that every lot has its own value of p drawn from a prior distribution of p values and the two lots sampled each have their own value of p, then the Bayesian approach and results make intuitive sense. EXERCISE 14.6 Show that a beta with parameters a = 1 and b = 1 is a uniform distribution on (0,1). EXERCISE 14.7 A sample of 50 units per lot is drawn from successive lots of manufactured components and life tested at stress for 24 hours. The observed failures for the first 10 lots sampled were 3, 2, 4, 3, 1, 2, 4, 3, 1, and 2. The 11th lot sampled had five failures. Use beta/binomial Bayesian methodology, starting with a uniform prior for p and incorporating all the sampled lot data prior to the 11th lot, to estimate p using the 11th lot data as the new sample and compute a 90% credibility interval. Compare these results to a classical analysis using just the data from the 11th lot.

14.5 The Normal/Normal Conjugate Prior Another useful conjugate prior/posterior relationship occurs when the sampled data have a normal distribution N( , σ 2 ) and μ is assumed to be random with a normal prior g(μ) that has known parameters �0, σ 20 . Here we also assume that σ 2 is known. After observing a complete sample x1, x2, … , xn with sample mean x from the normal N( , σ 2 ),the posterior distribution g(μ|x) also has a normal distribution with mean and variance given by

1

=

σ 2 0 + nxσ 20 ; σ 2 + nσ 20

σ 12 =

σ 2 σ 20 σ2 = 2 n σ + nσ 0 2

2

σ 02

σ + σ 20 n

.

Note that the mean of the posterior is a weighted average of the prior mean and the sample mean, and the variance of the posterior is smaller than the variance of the sample mean x (i.e., σ 2 n) by a factor that approaches 1 as n increases. This property means that the Bayesian credibility intervals will always be shorter in width than the corresponding classical intervals. Since we can always analyze lognormal failure times by taking natural logarithms of all failure times and treating them as a sample from a normal distribution, the above Bayesian conjugate normal prior analysis can be applied to lognormal data. We will illustrate this technique in Example 14.8. EXAMPLE 14.8 BAYESIAN ESTIMATION AND CREDIBILITY INTERVALS FOR THE LOGNORMAL T50 A company is bringing up manufacturing of an integrated circuit component at a new plant. Previous high-stress life testing at other locations has shown that the component has a lognormal

533

Bayesian Reliability Evaluation

distribution with T50 of 200 hours at stress with a sigma of 0.8. A sample of 10 units manufactured at the new plant is tested at the typical high stress until failure, resulting in failure times of 41.6, 71.8, 75.3, 86.6, 107.3, 120.1, 138.4, 141.8, 282.2, and 427.3 hours. Use the other location history to arrive at a normal prior for the μ = the natural logarithm of the T50 and use the new sample data (assuming the sigma of 0.8 still holds) to obtain a posterior distribution. What is the Bayesian estimate for the T50? Give a 90% credibility interval for the T50. Does the interval include the old T50 of 200 hours?

Solution After taking natural logarithms of the 10 failure times, you have a complete normal sample with a mean of 4.783 and an assumed variance of 0.64. Putting these values into the formula for the posterior mean results in μ1 = 4.83, and the Bayesian estimate for the T50 is 125.2. The estimate for the posterior standard deviation σ1 is 0.24. A 90% credibility interval for this T50 is given (using spreadsheet functions) by

[EXP(NORMINV(0.05, 4.83, 0.24)), EXP(NORMINV(0..95, 4.83, 0.24))] This evaluates to (84.2, 186.2), which excludes the historical value of 200.

Note that the examples considered in this section assumed that the variance parameter σ 2 for the sampled data was known. If both μ and σ 2 are random, we can use a joint conjugate prior distribution or a joint noninformative prior (see Martz and Waller 1982, section 9.2, or Hamada, Wilson, et al. 2008, section 2.5). EXERCISE 14.8 Create a single plot showing both prior and posterior normal distributions of Example 14.8. Comment on the differences between the two normal distributions.

14.6 Informative and Noninformative Priors The model one can specify as the Bayesian prior can result in a density that is spread out across the parameter space, giving almost equal weight to many different values of the parameter, or parameters, of interest, or the Bayesian prior can be heavily concentrated within a particular small region, giving little weight to parameter values outside that region. We call the spread out prior a noninformative prior and the concentrated prior an informative prior. If you have little or no starting information about the parameter, it makes sense to use a noninformative prior and let the sample data control the shape of the posterior distribution. We already saw the use of a noninformative prior in Example 14.6 when we chose a uniform distribution that weights equally all values of p in the unit interval. We can also chose a noninformative prior when doing a Bayesian exponential distribution system MTBF analysis instead of the prior chosen in Example 14.1. Recall that the prior was for the failure rate λ, which can range from (0, ∞). If we set g(λ) = 1 then we achieve our goal of weighting all values of λ equally. However, the

534

Applied Reliability

density g(λ) is not a probability density since it does not have a finite integral over (0, ∞). Does that rule out using it for a prior and applying Bayes’ formula to obtain a posterior? It turns out that it often does not matter if we start with what is called an improper prior. What matters is whether the key Bayesian equation

g(λ|x) =

f (x|λ)g(λ) ∞

∫ f (x|λ)g(λ)dλ 0

results in a proper probability distribution. When f (x|λ) is an exponential likelihood for r observed failure times and the prior g(λ) is a constant for all positive λ, the posterior distribution g(λ|x) turns out to be a gamma with parameters a = r + 1 and b = the sum of the r exponential failure times. In other words, the improper noninformative prior acts just like an improper gamma conjugate prior with parameters a = 1 and b = 0 (see Example 14.1).

EXAMPLE 14.9 USING AN IMPROPER NONINFORMATIVE PRIOR FOR EXPONENTIAL FAIL TIMES A new system has its first failure at 100 hours. Using a noninformative constant prior for the repair rate λ, find the Bayesian estimate of the MTBF, assuming an HPP model applies. The third repair takes place at 200 hours and the ninth repair takes place at 500 hours. Also, compute Bayes estimates for the MTBF at both these points in time and calculate a 90% credibility interval for these three MTBF estimates.

Solution The posterior distribution for λ given one repair at 100 hours is a beta with a = 1 + 1 = 2 and b = 100. The mean is 2/100 = 0.02 and therefore the Bayesian estimate of the MTBF is 1/0.02 = 50 hours. At 200 hours and three repairs, the posterior has parameters a = 1 + 3 = 4 and b = 200, again with a mean estimate of 4/200 = 0.02 and an MTBF estimate of 50 hours. At 500 hours and nine repairs, the posterior has parameters a = 1 + 9 =10 and b = 500, with the MTBF estimate of 50 hours. The 90% credibility intervals using spreadsheet functions are at 100 hr[1 / GAMMAINV(0.95, 2,1 / 100), 1/ GAMMAINV(0.05, 2,1/ 100)] = ( 211 . , 2814 . ) at 200 hr[1/ GAMMAINV(0.95, 4,1/ 200), 1/ GAMMAINV(0.05, 4,1/ 200)] = ( 25.8,146.4) at 500 hr [1/ GAMMAINV(0.95,10,1/ 500), 1/ GAMMAINV(0.05,10,1/ 500)] = (318 . , 92.2)

A look at a plot of the prior and the three posterior densities from Example 14.7 illustrates how a constant prior that gives equal weight to all values of λ transforms into a posterior that becomes increasingly concentrated around the mean the more sample data points there are. Figure 14.6 shows this, with the prior set at the constant 10 for easy visibility (any value leads to the same posterior distributions). We have seen examples of the noninformative prior for both exponential data (a constant prior on the positive axis) and binomial data (a constant of 1 on the unit interval). We can also use a noninformative prior for the mean when we have normal sample data

535

Bayesian Reliability Evaluation

Noninformative prior and gamma posterior densities

70 60

Improper constant prior

PDF f(x)

50

Gamma (a = 2, b = 100, mean = 0.02) Gamma (a = 4, b = 200, mean = 0.02)

40

Gamma (a = 10, b = 500, mean = 0.02)

30 20 10 0

0

0.01

0.02

0.03 x

0.04

0.05

0.06

FIGURE 14.6 Prior and posterior densities from Example 14.7.

as in Section 14.5. Again, we give equal weight to all possible values of μ, which now is the entire x-axis. When one starts with this noninformative prior for μ and has a complete sample x1, x2, … , xn that has distribution N( , σ 2 ) with σ 2 assumed known, the resulting σ2 posterior distribution is N x , . Estimates and intervals obtained using this posterior n turn out to be the same numerically as those obtained using classical methods. Because of the Bayesian approach, however, a confidence interval (credibility interval) for μ is an actual probability statement about the location of μ. There are other ways of defining noninformative priors based on sophisticated mathematical properties rather than on the simple “equal weights” criteria used here. The interested reader can find discussions of these in Box and Tiao (1973) or Martz and Waller (1982). EXERCISE 14.9 Verify that a constant prior for λ combined with complete sample data of r exponenr

tial failure times t1, t 2, … , tr leads to a gamma posterior with a = r + 1 and b =

∑t . i

i=1

Hint: Start with the key Bayesian equation for the posterior, substitute the exponential likelihood and a constant prior for f(t|λ) and g(λ), and then solve the integral in the denominator using iterated integration by parts. It may be helpful to first derive the result for one failure time (r = 1) before tackling the general result.

EXERCISE 14.10 Use the final system data in Example 14.9 (nine repairs, the last occurring at exactly 500 hours) to calculate a classical 90% confidence bound for the MTBF (using Tables 12.5 and 12.7). How does this interval compare to the Bayesian credibility interval of (31.8, 92.2) obtained using a noninformative prior?

536

Applied Reliability

14.7 A Survey of More Advanced Bayesian Methods The Bayesian methods described so far used a completely specified prior distribution and a current set of sample data to derive a posterior distribution for a parameter of interest. The prior was based upon past information and/or engineering judgment in some of the applications. In other examples, we chose a diffuse (i.e., noninformative) prior so as to more or less allow the current data to dominate. In all the examples, the mathematical form of the prior was such that the basic Bayesian posterior distribution equation could be solved in closed form. Conjugate priors met the last requirement, explaining their popularity and widespread use. In addition to the simple, but useful, cases we have described, there are many examples in the literature where analysts have used prior probability models that made engineering sense but led to complex integrals that could not be solved except by numerical methods or simulation studies. This task is much easier to do today because of the wide availability of computing power and Bayesian analysis software packages. An extension of the Bayesian framework called hierarchical Bayes also leads to more complicated models and analysis techniques. Hierarchical Bayesian models go one or more steps further than the models we have considered by treating the parameters of the prior model as themselves random and having a second stage hyperprior containing hyperparameters. These hyperparameters have to first be integrated out to arrive at the (marginal) probability distribution of the first stage prior. These models can be carried even further, treating the hyperparameters as random and so on, with increasingly more difficult equations to solve. All the Bayesian methods described so far depend heavily upon both the explicit choice of a prior and the values assumed for the prior parameters (or the hyperparameters, at the final stage in the hierarchical Bayes approach). Empirical Bayes methods attempt to eliminate this dependence upon the choice of either a prior model or the prior model parameter specifications. Robbins (1956, 1964) is credited with introducing the term “empirical Bayes.” He described a procedure whereby current data could be used to not only derive a posterior distribution but also come up with estimates for the parameters of the prior distribution. Empirical Bayes methods have been developed that do not even assume a particular form or model for the prior distribution They use current and past data and simulation techniques to derive estimates of the posterior distribution. This approach is called “nonparametric empirical Bayes.” Methods like this, of course, rely heavily on current computer capabilities and specialized software. A criticism of empirical Bayes is that it no longer attempts to introduce prior knowledge or judgment to augment the information contained in current samples. Thus, it departs from original intent of Bayesian analysis. Carlin and Louis (1996) bypass philosophical questions about empirical Bayes and demonstrate how it provides a useful set of tools for analyzing and understanding data. The most significant development in Bayesian analysis over the last 2 decades is the implementation of Markov chain Monte Carlo (MCMC) methods and software. When dealing with multidimensional parameter spaces and complicated, possibly hierarchical, models, deriving a closed-form expression for the posterior distribution of the parameter, or parameters, of interest is often impossible. Even deriving expressions that can be numerically integrated is likely to be too difficult. The only viable alternative is to simulate values from the posterior distribution and compute percentiles of interest or density estimates from these simulated data. This situation also holds for most analyses where the sampled data are censored.

Bayesian Reliability Evaluation

537

MCMC simulation is a procedure that generates approximate samples from the posterior distribution g(θ|x), where θ is a random parameter vector and x is a vector of new observations from the conditional distribution f(x|θ). The simulation starts with an initial selection of a starting value for θ we will call θ1. We then generate the sequence θ2, θ3, θ4, … , θk, with the property that the conditional density of θk, given all the preceding θi, only depends on the last simulated value of θ, which was θk−1. In other words, f (θ k|θ k −1 , θ k − 2 , … , θ1 ) = f (θ k|θ k −1 ) for all k. This last equation is the Markov property, and the sequence of generated θs form a Markov chain. In order to use MCMC simulation, we need three things:

1. A way of using the Bayesian model and the observed data x to come up with the first simulated value θ. 2. An algorithm that generates θk after the value θk−1 has been observed. 3. Software that can automate all the calculations required in steps 1 and 2.

There are two widely used methods for carrying out steps 1 and 2. These are the Metropolis–Hastings algorithms and the Gibbs samplers. Interested readers can find out details of these algorithms in Hamada et al. (2008), Carlin and Louis (1996), and Casella and George (1992). The third necessity, good Bayesian simulation software, is available in a package called BUGS (Bayesian inference Using Gibbs Sampling). BUGS can be downloaded for free in a package called WinBUGS or in an open-source package called OpenBUGS. JAGS (Just Another Gibbs Sampler) is a another free program for analysis of Bayesian hierarchical models using MCMC simulation (see Ntzoufras 2009 for detailed instruction and examples showing how to do Bayesian modeling using WinBUGS). The Markov sequence of parameter vectors generated by a program such as WinBUGS θ1, θ2, θ3, θ4, … , θk will typically converge to a stationary Markov distribution that is the posterior distribution f(x|θ) we are after. The user of the program will have to check that convergence is happening and pick a large enough value of k such that all θi with i > k can be considered to be an approximately random sample from f(x|θ).

14.8 Summary A Bayesian approach regards parameters considered fixed in classical analysis as random variables, with a prior distribution model assigning probability weight to regions in the parameter space. After observing sample data with a known likelihood function given the parameters, Bayes’ formula tells us how to calculate the posterior parameter distribution. With the posterior distribution known, we can estimate mean or median values for the parameters and calculate true confidence bounds for these parameters. The examples shown in detail in the chapter illustrate ways to use old data and/or engineering judgment to come up with prior parameters. Conjugate distributions were used for exponential, binomial, and normal/lognormal data since these distributions simplified calculations and were flexible and reasonable choices. We also showed how to use noninformative priors for these same sample distributions when little was known in advance about the prior parameters. In order to use conjugate distributions or these noninformative priors, however, we had to restrict ourselves to the case where the sample observations were complete samples.

538

Applied Reliability

These Bayesian applications demonstrated that often a Bayesian approach can yield more precise estimates or qualify systems with shorter test times. The savings can be significant when either past data or an engineering judgment consensus leads to an encouraging prior model. More recent Bayesian approaches, such as hierarchical Bayes and empirical Bayes techniques, can handle much more complicated models but generally require MCMC simulations and specialized software such as WinBUGS.

Appendix 14A 14.1A Gamma and Chi-Square Distribution Relationships There are some very useful relationships between the gamma and the chi-square distribution that we may take advantage of when planning system reliability confirmation tests using Bayesian methodology. If G(y; a, b) denotes the gamma CDF with parameters a, b as defined previously in Section 14.1, then the chi-square random variable with 2a degrees of freedom, χ 22 a, has a gamma distribution with CDF G(y; a, 0.5). Thus, in Excel, 1-CHIDIST(y, 2*a) and GAMMADIST(y, a, 2, 1) will return the same CDF values. Also, if Y has a gamma distribution with CDF G(y; a, b), then 2bY has a gamma distribution with CDF G(y; a, 0.5), which again is a χ 22 a random variable. In addition, we have that G(y; a, b) = G(2by; a, 0.5). We can write these identities in Excel notation (recalling that Excel uses β = 1/b as the scale parameter) as follows:

{GAMMADIST(y , a, 2, 1)} = {1 − CHIDIST(y , 2 * a)} = {GAMMADIST(2 * b * y , a, 1/b, 1)} and

{GAMMADIST( y , a,b, 1)} = {GAMMADIST( y * 2 * b, a, 2, 1)}

The equals signs between the bracketed Excel expressions mean that each expression returns the same value no matter what choice of y, a, or b is used. We need to use {1 − CHIDIST} instead of CHIDIST because Excel returns the upper tail for CHIDIST, which is 1 minus the CDF.

Problems 14.1 The consensus on a new tool is that the MTBF is equally likely to be above 500 hours as below. It is also considered highly unlikely that the MTBF will be as low as 100 hours. The value 100 is considered to be a 95% lower limit for the MTBF prior to testing. 14.2 You wish to confirm an MTBF of 400 hours at 90% confidence. Assume you want to allow up to two failures and still “pass” the qualification test. a. How long must you test assuming a classical analysis? If you run this test and have only one failure, what MTBF have you confirmed at 90% confidence?

Bayesian Reliability Evaluation

539

b. Prototype testing that engineering judgment accepts as relevant gave only three failures in 1350 hours. Use these data to determine a new (Bayesian) test time that will allow up to two new failures and still confirm a 400-hour MTBF at 90% confidence. If the new test only has one failure, what MTBF have you actually confirmed? c. Repeat the analysis in part b, assuming there is an engineering consensus on a median (50%) estimate for the MTBF of 450 hours and a “worst-case” (95%) estimate of 150 hours. 14.3 Assume the test in Example 14.4 is run for the minimum Bayesian test time of 1303 hours, but instead of zero failures, one failure occurs. What is the new failure rate estimate and what is an upper 95% bound for the failure rate? How long would you have planned the test to run if you wanted to allow one failure and still be 90% confident of meeting the MTBF specification of 1000 hours?

This page intentionally left blank

Answers to Selected Exercises

Chapter 1

1.1 The Excel spreadsheet Table1.1.xls (on the publisher’s website at http:// www.crcpress.com/product/isbn/9781584884668) has the Table 1.1 measurements in Column A, the interval boundaries in Column B, and the matrix function FREQUENCY set up to produce the frequency table shown in Table 1.2 in Column C. 1.2 Column D of Table1.1.xls has the new cell boundaries, and the new frequency numbers for the six intervals are calculated in Column E. They are 4, 23, 27, 31, 14, and 1. 1.6 There are 2 black aces out of 52 cards, so P(AB) = 2/52. But P(A) = 4/52 and P(B) = 1/2, so P(AB) = P(A)P(B) and the events are independent. 1.7 0.28, 0.41, 0.31. 1.8 0.873 (assuming independence and all components must survive for the electronic card to survive). 1.10 1, u, 0.5, and 1/12 for the density, CDF, mean, and variance, respectively. 1.11 f (t) = (1 + t)−2, T50 = 1, and the mean does not exist ( = ∞). 1.12 The mean of a β = a/(a + b). For this to be 0.5, a must equal b.

Chapter 2

2.1 a. 0.3935, 0.7062 b. 0.3127 c. 0.2057 d. 78.7, 62.54 2.2 0.281. 2.4 F(t) = 1 − e−λt. 2.5 a. Could only be a hazard function b. Could only be a PDF (if c = 1) c. Could only be a hazard function d. Could only be a hazard function e. Could be a PDF or a hazard function 2.6 353.553%/K and 353,5530 FITs, 111.803%/K and 111,8030 FITs, 35.355%/K and 353,550 FITs, 11.18%/K and 111,800 FITs, 3.356%/K and 33,560 FITs. 541

542

Answers to Selected Exercises

2.7 39.2%/K. 2.8 0.74, 0.855, 1.06, 1.70, 3.33, 6.24, 9.44, and 10 (all in percentage per year). 2.9 h(t) = a/(b + t), MTTF = b/(a − 1), ResidualMTTF(T0) = (b + T0)/(a − 1); for a = 2, MTTF = b and ResidualMTTF(T0) = b + T0. 2.10 There is no aging when the exponential distribution applies.

Chapter 3

3.2 3.3 3.5 3.6 3.7 3.8

2%/K, 5,268 hours, 34,657 hours. 0.24, 0.24. λ = 0.007421. λ = 0.007829. 0.736. b. λ = 99.38%/K, MTTF = 1006.26 hours. c. 690.7, 688. d. The chi-square statistic is 3.47. When compared to a table value (90% level, 3 df) of 6.25, we cannot reject an exponential model fit. e. 1068 hours, 1173.8 hours. 3.10 Both JMP and Minitab estimate λ = 135.63. JMP calculates a KSL D value of .086558, which has a p value of > .15. Minitab calculates an AD value of .93, which has a p value of .142. In both analyses, the exponential model cannot be rejected. 3.11 (41,584) in FITs. 3.12 2773 FITs, 1386 FITs. 3.13 0.0013, 0.50 using the 50% estimate and 0.00277, 0.025 using the 75% estimate. 3.14 5324, 3106, 2023. 3.15 22,554 hours, 6752 hours. 3.16 380. 3.17 11,513 hours, 4582 hours. 3.18 2940 hours, (1940 hours, 4106 hours). 3.1A 5320 (5323), 3108 (3106), 2024 (2023). 3.2A 22,532 hours (22,554 hours), 6,756 hours (6,752 hours).

Chapter 4

4.1 518 hours. 4.2 3.1%.

Answers to Selected Exercises

543

4.3 11,157 hours. 4.4 1.10. 4.5 2.52 × 105 hours. 4.6 37%. 4.7 2000 hours, 480 hours, 4472 hours. 4.8 α = 625 hours, β = 0.5, MTTF = 1250 hours, median life = 300 hours, Rs(100) = exp(−0.4) = 0.67. 4.9 Assuming the Rayleigh distribution: 44 to 45 hits; for a uniform distribution: 24 to 25 hits. 4.10 Oven 1: ln α MLE is 6.497, β MLE is 1.954, and L = 165.359; oven 2: ln α MLE is 6.560, β MLE is 1.966, and L = 33.559; oven 3: ln α MLE is 6.798, β MLE is 2.082, and L = 30.508; all cells combined: ln α MLE is 6.601, β MLE is 2.032, and L = 231.054. 4.11 (2813.6, 12023.2). 4.12 90% interval for α is (660.45, 826.7); 90% interval for β is (1.66. 2.45). 4.14 Weibull fit has α MLE of 120.83 and β MLE of 1.86. Both JMP and Minitab come up with goodness of fit p values greater than or equal to 0.25, so the Weibull model is not rejected. 4.3A (2,784, 12,452) is a 95% interval for α. 4.4A (1.66, 2.45) is a 90% interval for β and (660, 827) is a 90% interval for α.

Chapter 5 5.1 0.00199, 0.02275, 7.36 oz. 5.2 32, 6, 0.09121, 0.05692. 5.3 11.25 μm, 9.75 μm. 5.6 μ MLE is 11.9634 and σ MLE is 1.01268. 5.7 JMP should find the same MLEs as Minitab did in Exercise 5.6. 5.9 0.0216 or 2.16%. 5.10 37.5%, 60.3%. 5.11 59,972 hours, 334.97 hours, h(1,000) = 99.7 FITs, h(10,000) = 14,430 FITs and h(40,000) = 22,160 FITs. 5.13 For exact times, μ MLE is 8.0419, σ MLE is 2.5481, and L = 227.153. For interval data, μ MLE is 7.9542, σ MLE is 2.2007, and L = 90.4525 (this time the lognormal L value is 0.962 better than the Weibull L value, showing that both models match the data equally well). 5.14 Oven 1: μ MLE is 6.210, σ MLE is 0.686, and L = 166.847; oven 2: μ MLE is 6.296, σ MLE is 0.617, and L = 33.262; oven 3: μ MLE is 6.629, σ MLE is 0.689, and L = 29.710; all cells combined: μ MLE is 6.360, σ MLE is 0.670, and L = 232.118.

544

Answers to Selected Exercises

5.15 μ MLE is 10.292, σ MLE is 1.5036, and L = 370.0582. The lognormal analysis of the truncated data has a practically indistinguishable smaller L (for the Weibull, L was 370.0607—see Figure 4.16). 5.15 1132 hours. 5.16 13.5%. 5.17 38,245 hours. 5.18 σ = 1.46. 5.19 T50 = 50.1 × 106 hours.

Chapter 6

6.1 m = −A/B, b = −C/B; possible solutions are A = k, B = −k/m, and C = kb/m, where k is any constant. 6.2 131°F. 6.3 The yˆ values corresponding to x = 1, 2, 3, 4, and 5 are 3.06, 4.03, 5, 5.97, and 6.94, respectively. 6.4 y = a + bS, where y = ln Y; plot ln Y versus S, B estimate is 2T(slope), A estimate is exp(2(intercept)). 6.5 A = 1.02, B = 418 Kelvin (using least squares). 6.6 MTTF = 182.8 hours. 6.7 MTTF = 187.0 hours, threshold = 7.42 hours. 6.8 MTTF = 29.1 hours (line through origin). 6.9 At t = 1, 10, 96 hours, LCL = 0.005, 0.30, and 0.74, respectively. UCL = 0.26, 0.78, and 0.995, respectively. 6.10 MTTF = 174.7 hours (line through origin); 90% confidence limits: at t = 168, LS estimate F = 0.62, LCL = 0.50, UCL = 0.82; at F = 0.65, LS estimate t = 183 hours, LCL = 122 hours, UCL = 302 hours. 6.12 β estimate = 6.053, α estimate = 52.76. 6.13 79.3. 6.5A Median of max = 204.4, CDF = 0.871, median rank = 0.871. 6.6A Median of min = 96.6, CDF = 0.129, median rank = 0.129.

Chapter 7

7.3 The ordered failure times are 105, 512, 2417, 3250, 5997, and 7012 hours. The corresponding KM CDF estimates are 0.0022, 0.0044, 0.0078, 0.0111, 0.0145, and 0.0245.

Answers to Selected Exercises

545

Chapter 8

8.1 fu (t) = dFu (t)/dt = dFS (t/AF )/dt = (1/AF ) fS (t/AF ) hu (t) =

8.2

fu (t) (1/AF ) fS (t/AF ) = (1/AF )hS (1/AF ) = 1 − Fu (t) 1 − FS (t/AF )

tu = AFtS2 Fu (t) = FS

t AF

fu (t) = (1/2 tAF ) fS ( t/AF ) hu (t) = (1/2 tAF ) hS ( t/AF )

8.3

When tS AF is less than 1, time to fail at use is greater than time to fail at stress. The crossover time is tS = 1/AF, after which times to fail at use are later than times to fail at stress (which does not make physical sense if stress truly “accelerates” failure times). tu = AF etS Fu (t) = FS ln

t AF t AF t ln AF

fu (t) = (1/t ) fS ln hu (t) = (1/t )hS

8.4 Fu (t) = 1 − hu (t) =

t AF

λS

λS t

8.6 84.34. 8.7 When β is less than 1, the acceleration will be less than 100X. 8.8 The 125°C cell had more failures than the other two cells combined, and therefore it carries more “weight” when estimating parameters. 8.9 βˆ = 1.12, αˆ = 5862, λˆ = 17.67 %/K , and the chi-square test statistic has 50.9% sig-

nificance, indicating that there is no reason to reject an exponential model. 8.10 The common shape MLE is 1.55 but the hypothesis of equal shapes is rejected at 99.9% confidence (chi-square = 13.35 with 1 df). 8.11 The chi-square test statistic is 2 × (231.054 − 229.426) = 3.256 with 6 − 2 = 4 df. The significance level (p value) is 0.516, indicating there is no reason to reject the hypothesis that all three samples come from the same population. 8.12 The 95% intervals are (4,974 ≤ α85 ≤ 43,434); (1,100 ≤ α105 ≤ 3,481); (143 ≤ α125 ≤ 365).

546

Answers to Selected Exercises

8.13

8.14 The negative log likelihood for the lognormal (same sigma) fit is 412.749, whereas the negative log likelihood for the Weibull fit (same β) is 425.492. This indicates a “better” lognormal fit, but no significance level can be attached to that statement. 8.18 A 25.5-hour burn-in is needed. AFR(720) = 28.2 FITs after a 1-hour burn-in. 8.19 Slope estimate using 200, 400, 600 and 720 hour points is 0.998. Slope estimate using 100,200, 100,400, 100,600 and 100,720 hour points has dropped to 0.88. tu = AFtS2 Fu (t) = FS

t AF

fu (t) = (1/2 tAF ) fS ( t/AF )

hu (t) = (1/2 tAF ) hS ( t/AF ) When tS AF is less than 1, time to fail at use is greater than time to fail at stress. The crossover time is tS = 1/AF after which times to fail at use are later than times to fail at stress.

547

Answers to Selected Exercises

Chapter 9 k

9.5 LIK = p r

∏

F(Tj − Tj−1 )

rj

1 − pF(Tk )

n− r

j=1

9.7 The MLEs for the fraction mortal, μ, and σ are 0.2482, 7.288, and 0.489, respectively. The likelihood ratio test does not, however, reject the hypothesis that all the population is mortal.

Chapter 10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8

The board failure rate is 6892 FITs, and the reliability at 40,000 hours is 0.725. 10,018 FITs and 0.697. 1 FITs. 0.999925. 0.9850, 0.7351. 54,167 hours. (1 − F1F2 F3 )(1 − F4 )(1 − F5 )(1 − F6 )(1 − F7 F8 )[6(1 − F )2 F 2 + 4(1 − F )3 F + (1 − F )4 ] FE[1 − (1 − RARB)(1 − RCRD)(1 − RFRG)] + RE(1 − FAFEFF)(1 − FBFDFG)

Chapter 11 11.1 34,650 11.2 10 11.3 n Cr = ( nr ) = 11.4 11.7 11.8 11.9 11.10 11.11 11.12 11.13 11.14 11.17 11.18

n(n − 1) × × (n − r + 1) . r(r − 1) × × 1

0.0769, 2.5, 0.9231 1, 0.866, 0.316, 0.422, 0.039, 0.684 0.9995 10.0% to 33.7%, No 8.9% to 31.1%, off by ~ 10% 46.2 PPM, exact (10, 135) PPM, normal (−6.1, 98.4) PPM (0.111, 0.333) (15.7, 135.7) PPM 0.99465 = exact answer 17/33 Normal approximation: 0.03285; Fisher’s ET: 0.09253

548

Answers to Selected Exercises

11.19 Probabilities (Group 2 fails): 0.09253 (0); 0.3266 (1); 0.3810 (2); 0.1738 (3); 0.02607 (4) 11.21 Binomial: 0.28575; Poisson: 0.28730 11.22 1. 2, 2. 20, 3. 0.1353, 4. 0.8647, 5. 0.0361, 6. 2E-9, 7. Poisson 0.2366, binomial 0.2575 11.23 0.3717 11.24 0.8694 (0), 0.1217 (1), 0.0085 (2), 0.000398 (3); combined λ = 0.15; confidence interval (17.8, 47.2) 11.25 23, (15.7, 32.6) 11.30 Minimum sample size is 158, c = 4 11.31 Minimum sample size is 590, c = 12 11.34 c = 0, n = 103; c = 1, n = 175 11.35 c = 0, AQL = 0.050%; c = 1, AQL = 0.20% 11.36 AQL = 0.16%; RQL = 1.06% 11.37 RQL = 1.06%: c = 0, AQL = 0.024%; c = 1, AQL = 0.097% 11.38 AQL = 0.46%: c = 0, n = 12, RQL = 17.5%; c = 1, n = 79, RQL = 4.8%; c = 2, n = 180, RQL = 2.9% 11.39 11,513 11.40 23,717 11.41 n = 783, c = 7 11.42 n = 132, c = 3 11.43 90% control limits: 2565 to 149785; 50,000; 34,657

Chapter 12 12.6 T = 4000 is the required test time. 12.7 The new failure rate estimate is 2/2305 = 0.000868 fails/h. A 95% upper bound is 0.00206 fails/h. T = 2890 hours if we want to succeed despite one failure. 12.9 0.98807. 12.10 2. 12.11 2.4, 15 hours, 0.670. 12.14 0.01, [0.0050, 0.0164]. 12.15 4.4, 2000 hours, 4.4. 12.4A 0.1802.

Chapter 13 13.3 Significance levels 29%, 15%, 1%. 13.5 6. 13.7 Partial answers: 24 permutations; respective cumulative probabilities of 0.042, 0.167, 0.375, 0.625, 0.833, 0.958, 1.000 for 0, 1, 2, 3, 4, 5, 6 reversals.

Answers to Selected Exercises

549

13.8 120, Pr(R less than or equal to 1) = 0. 042, Pr(R greater than 8) = 0.042, Pr(R less than 9) = 95.8%. 13.10 50%, 0.002%, 0.004%. 13.11 150, shorter, 0.81%. 13.12 69, 36. 13.14 5.1, 0.0060, 0.963. 13.23 The MLE for MTBF(1400) is 159 hours, whereas the estimate based on an unbiased slope is 173.4 hours. 13.24 The MLE of the improvement slope is 0.35, and the modified estimate is 0.40. The FIT is reasonable, and a 95% interval for MTBF(650) is (34.8, 207.6). 13.25 The MLE of the improvement slope and the MTBF(320) are 0.455 and 97.8, respectively. The corresponding modified estimates are 0.546 and 117.4. The FIT, however, is rejected by the Q test at a greater than 99% level. 13.27 β = 0.333, T = 512 hours. 13.28 22.36 × improvement.

Chapter 14 14.1 Bayes test time = 2890 hours, which is larger than the classical test time. 14.4 a = 1.6134. 14.5 a = 5.77, b = 8159, and the minimum test time is 2077 hours. With r = 1, the test time is 3380 hours. The corresponding classical test times are 3000 hours and 4740 hours. 14.7 a = 25 + 1 = 26, b = 500 − 25 + 1 = 476. a = 26 + 5 = 31, b = 476 + 50 − 5 = 521. The estimate of p is the mean of the posterior distribution or 31/552 = 0.056. A 90% Bayes ian credibility interval is given by [BETAINV(.05,31,521), BETAINV(.95,31,521)] or (0.041, 0.073). The corresponding classical confidence interval using just the lot 11 data would be [BETAINV(.05,5,46), BETAINV(.95,6,45)] or (0.04, 0.20). 14.10 The classical interval is (34.6, 106.5) which is slightly wider than the Bayesian interval.

This page intentionally left blank

References Aalen, O. O. 1978. Nonparametric inference for a family of counting processes. Ann Stat 6:534–45. Abernethy, R. B. ed. 1996. The New Weibull handbook. 2nd ed. North Palm Beach, FL: Robert B. Abernethy. Abramovitz, M., and I. A. Stegun, eds. 1964. Handbook of Mathematical Functions. Washington, DC: National Bureau of Standards. Agresti, A., and B. A. Coull. 1998. Approximate is better than exact for interval estimation of binomial parameters. Am Stat 52:119–26. Akritas, M. G. 1986. Bootstrapping the Kaplan-Meier estimator. J Am Stat Assoc 81:1032–8. Ascher, H. 1981. Weibull distribution vs. Weibull process. In 1981 Proceedings Annual Reliability and Maintainability Symposium, Philadelphia, PA, 426–31. Ascher, H., and H. Feingold. 1984. Repairable Systems Reliability. New York: Marcel Dekker. Bain, L. J., and M. Englehardt. 1991. Statistical Analysis of Reliability and Life-Testing Models: Theory and Methods. 2nd ed. New York: Marcel Dekker. Baird, D. C. 1995. Experimentation, An Introduction to Measurement Theory and Experiment Design. 3rd ed. Englewood Cliffs, NJ: Prentice Hall. Barlow, R. E., and F. Proschan. 1975. Statistical Theory of Reliability and Life Testing. New York: Holt, Rinehart and Winston. Barlow, R. E., F. Proschan, and E. Scheuer. 1969. A system debugging model. In Operations Research and Reliability, ed. D. Grouchko, 401–20. New York: Gordon and Breach. Bates, G. E. 1955. Joint distributions of time intervals for the occurrence of successive accidents in a generalized polya scheme. Ann Math Stat 26:705–20. Berk, K. N., and P. Carey. 2004. Data Analysis with Microsoft Excel. Belmont, CA: Brookes/Cole/ Thomson Learning. Bernard, A., and E. C. Bos-Levenbach. 1953. The plotting of observations on probability paper. Statistica 7:163–73. Beyer, W. H., ed. 1968. Handbook of Tables for Probability and Statistics, Second Edition. Cleveland: The Chemical Rubber Co. Borgan, O., and K. Liestol. 1990. A note on confidence intervals and bands for the survivor function based on transformations. Scand J Stat 17:35–42. Box, G. E. P., and G. C. Tiao. 1973. Bayesian Inference in Statistical Analysis. Reading, MA: Addison-Wesley. Brown, L. D., T. T. Cai, and A. DasGupta. 2001. Interval estimation for a binomial proportion. Stat Sci 16(2):101–33. Burr, I. W. 1976. Statistical Quality Control Methods. New York: Marcel Dekker. Burr, I. W. 1979. Elementary Statistical Quality Control. New York: Marcel Dekker. Calvin, T. W. 1983. Quality control techniques for “zero defects”. IEEE Trans Compon Hybrid Manuf Technol 6:323–8. Carlin, B. P., and T. A. Louis. 1996. Bayes and Empirical Bayes Methods for Data Analysis. London: Chapman & Hall. Casella, G., and E. George. 1992. Explaining the gibbs sampler. Am Stat 46:167–74. Chace, E. F. 1976. Right-censored grouped life test data analysis assuming a two-parameter Weibull distribution function. Microelectron Reliab 15:497–9. Chambers, J. M. et al. 1983. Grouped Methods For Data Analysis. Monterey, CA: Wadsworth. Clopper, C. J., and E. S. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrica 26:404. Cohen, A. C., and B. J. Whitten. 1988. Parameter Estimation in Reliability and Life Span Models. New York: Marcel Dekker. Cook, R. J., and J. F. Lawless. 2007. The Statistical Analysis of Recurrent Events. New York: Springer. 551

552

References

Cox, D. R. 1953. Some simple approximate tests for poisson variates. Biometrika 40:354–60. Cox, D. R. 1962. Renewal Theory. New York: John Wiley & Sons. Cox, D. R. 1972. Regression models and life tables (with discussion). J R Stat Soc B 34:187–220. Cox, D. R., and P. A. W. Lewis. 1966. The Statistical Analysis of Series of Events. New York: John Wiley & Sons. Cox, D. R., and D. Oakes. 1984. Analysis of Survival Data. London: Chapman & Hall. Cran, G. W. 1976. Graphical estimation methods for weibull distributions. Microelectron Reliab 15:47–62. Crow, L. H. 1974. Reliability analysis for complex repairable systems. In Reliability and Biometry, ed. F. Proschan and R. J. Serfling, 379–410. Philadelphia: SIAM. Crow, L. H. 1982. Confidence interval procedures for the Weibull process with applications to reliability growth. Technometrics 24(1):67–72. Crow, L. H. 1990. Evaluating the reliability of repairable systems. In Proceedings Annual Reliability and Maintainability Symposium, Los Angeles, CA, 275–9. Crow, L. H. 1993. Confidence intervals on the reliability of repairable systems. In Proceedings Annual Reliability and Maintainability Symposium, Atlanda, GA, 126–34. Crowder, M. J. et al. 1991. Statistical Analysis of Reliability Data. New York: Chapman & Hall. D’Agostino, R. B., and M. A. Stephens. ed. 1986. Goodness-of-fit techniques. R. B. D’Agostino and M. A. Stephens. New York: Marcel Dekker. David, H. A., and H. N. Nagaraja. 2003. Order Statistics. 3rd ed. Hoboken, NJ: John Wiley & Sons. Davis, D. J. 1952. An analysis of some failure data. J Am Stat Soc 47:113–50. De Le Mare, R. F. 1991. Testing for reliability improvement or deterioration in repairable systems. Qual Reliab Eng Int 8:123–32. Deming, W. E. 1982. Quality, Productivity, and Competitive Position. Cambridge, MA: Massachusetts Institute of Technology. Dodge, H. F., and H. G. Romig. 1959. Sampling Inspection Tables, Single and Double Sampling. 2nd ed. New York: John Wiley & Sons. Duane, J. T. 1964. Learning curve approach to reliability monitoring. IEEE Trans Aerosp 2:563–6. Duke, S. D., and W. Q. Meeker Jr. 1981. CENSOR—A user-oriented computer program for life data analysis. Am Stat 34:59–60. Duncan, A. J. 1986. Quality Control and Industrial Statistics. 5th ed. Homewood, IL: Richard D. Irwin. Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Ann Stat 7:1–26. Efron, B. 1981. Censored data and the bootstrap. J Am Stat Assoc 76:312–9. Elsayed, E. A. 1996. Reliability Engineering. Reading, MA: Addison-Wesley Longman. Eyring, H., S. Glasstones, and K. J. Laidler. 1941. The Theory of Rate Processes. New York: McGraw-Hill. Fasser, Y., and D. Brettner. 1992. Process Improvement in the Electronics Industry. New York: John Wiley & Sons. Feller, W. 1968. An Introduction to Probability Theory and Its Applications. Vol. 1, 3rd ed. New York: John Wiley & Sons. Fisher, R. A. 1954. Statistical Methods for Research Workers. 12th ed. London: Oliver and Boyd. Gnedenko, B. V., Y. K. Belyayev, and A. D. Solovyev. 1969. Mathematical Methods of Reliability Theory. New York: Academic Press. Grant, E. L., and R. S. Leavenworth. 1988. Statistical Quality Control. 6th ed. New York: McGraw-Hill. Greenwood, M. 1926. The natural duration of cancer. Rep Public Health Med Subj 33:1–26. Guenther, W. C. 1974. Sample size formulas for some binomial problems. Technometrics 16:465–7. Gumbel, E. J. 1954. Statistical Theory of Extreme Values and Some Practical Applications. Washington, DC: National Bureau of Standards. Hahn, G. J. 1979. Minimum and near minimum sampling plans. J Qual Technol 11(4):206–12. Hahn, G. J., and S. S. Shapiro. 1967. Statistical Models in Engineering. New York: John Wiley & Sons. Hall, W. J., and J. A. Wellner. 1980. Confidence bands for a survival curve from censored data. Biometrika 67:133–43. Hamada, M. S., A. G. Wilson et al. 2008. Bayesian Reliability. New York: Springer. Harris, E. K., and A. Albert. 1991. Survivorship Analysis for Clinical Studies. New York: Marcel Dekker.

References

553

Hogg, R. V., and E. A. Tanis. 2001. Probability and Statistical Inference. 6th ed. Upper Saddle River, NJ: Prentice Hall. Hollander, M., I. W. McKeague, and J. Wang. 1997. Likelihood ratio-based confidence bands for survival functions. J Am Stat Assoc 92:215–26. Hyde, J. 1980. Survival analysis with incomplete observations. In Biostatistics Casebook, ed. R. G. Miller, B. Efron, B. W. Brown, and I. E. Moses, 31–46. New York: John Wiley & Sons. Ishikawa, K. 1982. Guide to Quality Control. Tokyo: Asian Productivity Organization. Jensen, F., and N. E. Petersen. 1982. Burn-In: An Engineering Approach to the Design and Analysis of Burn-in Procedures. New York: John Wiley & Sons. Johnson, L. G. 1951. The median ranks of sample values in their population with an application to certain fatigue studies. Ind Math 2:1–9. Johnson, L. G. 1964. Theory and Technique of Variation Research. Amsterdam: Elsevier. Juran, J. M., ed. 1988. Quality Control Handbook. 4th ed. New York: McGraw-Hill. Kalbfleisch, J. D., and R. L. Prentice. 2002. The Statistical Analysis of Failure Time Data. 2nd ed. Hoboken, NJ: John Wiley & Sons. Kane, V. E. 1986. Process capability indices. J Qual Technol 18(1):41–52. Kaplan, E. L., and P. Meier. 1958. Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–81. Kapur, K. C., and L. R. Lamberson. 1977. Reliability in Engineering Design. New York: John Wiley & Sons. Kececioglu, D. 1991. Reliability Engineering Handbook. Vol. 2. Englewood Cliffs, NJ: Prentice Hall. Kendall, M. G. 1938. A new measure of rank correlation. Biometrika 30:81–93. Kielpinski, T. J., and W. Nelson. 1975. Optimum censored accelerated life-tests for the normal and lognormal life distributions. IEEE Trans Reliab R-24(5):310–20. Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis, Techniques for Censored and Truncated Data. 2nd ed. New York: Springer-Verlag. Kleinbaum, D. G., L. L. Kupper, K. E. Muller, and A. Nizam. 1998. Applied Regression Analysis and Multivariable Methods. 3rd ed. Pacific Grove, CA: Duxbury Press. Kolmogorov, A. N. 1941. On a logarithmic normal distribution law of the dimensions of particles under pulverization. Dokl Akad Nauk USSR 31(2):99–101. Kotz, S., and N. Johnson. 1993. Process Capability Indices. London: Chapman & Hall. Kunter, M. H., C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models. 5th ed. Boston, MA: Mc-Graw-Hill/Irwin. Landzberg, A. H., and K. C. Norris. 1969. Reliability of controlled collapse interconnections. IBM J Res Dev 13:3. Lawless, J. F. 2003. Statistical Models and Methods for Lifetime Data. 2nd ed. New York: John Wiley & Sons. Lawless, J. F., and J. C. Nadeau. 1995. Some simple robust methods for the analysis of recurrent events. Technometrics 37:158–68. Lilliefors, H. 1969. On the kolmogorov-smirnov test for normality with mean and variance unknown. J Am Stat Assoc 64:387–9. Lindley, D. V. 1965. Introduction to Probability and Statistics from a Bayesian Viewpoint, Part I, Probability and Part 2, Inference. Cambridge, UK: Cambridge University Press. Mann, H. B. 1945. Nonparametric test against trend. Econometrica 13:245–59. Mann, N. R., R. E. Schafer, and N. D. Singpurwalla. 1974. Methods for Statistical Analysis of Reliability and Life Data. New York: John Wiley & Sons. Martz, H. F., and R. A. Waller. 1982. Bayesian Reliability Analysis. New York: John Wiley & Sons. Massey, F. J., Jr. 1951. The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association 46(253):68–78. Meeker, W. Q., and L. A. Escobar. 1998. Statistical Methods for Reliability Data. New York: John Wiley & Sons. Meeker, W. Q., and W. Nelson. 1975. Optimum accelerated life-tests for the weibull and exteme value distributions. IEEE Trans Reliab R-24(5):321–32.

554

References

Mendenhall, W., R. L. Schaeffer, and D. D. Wackerly. 1990. Mathematical Statistics with Applications. Belmont, CA: Wadsworth. Michael, J. R., and W. R. Schucany. 1986. Analysis of data from censored samples. In Goodness of Fit Techniques, ed. R. B. D’Agostino and M. A. Stephens, 461-496. New York: Marcel Dekker. Middleton, M. R. 2004. Data Analysis Using Microsoft Excel. Belmont, CA: Brookes/Cole/Thomson Learning. MIL-HDBK-189. 1981. Reliability Growth Management. U. S. Government Printing Office: Washington, DC. MIL-HDBK-217F. 1986. Reliability Prediction of Electronic Equipment. U. S. Government Printing Office: Washington, DC. Miller, L. H. 1956. Table of percentage points of kolmogorov statistics. J Am Stat Assoc 51:111–21. MIL-M-38510J. 1991. General Specification for Military Specification Microcircuits, U. S. Government Printing Office. MIL-S-19500G. 1963. General Specification for Semiconductor Devices. U. S. Government Printing Office. MIL-STD-105E 1989. Sampling Procedures and Tables for Inspection by Attributes. U. S. Government Printing Office: Washington, DC. MIL-STD-1629A. 1980. Procedures For Performing a Failure Mode, Effects and Criticality Analysis. U. S. Government Printing Office: Washington, DC. MIL-STD-756B. 1963. Reliability Modeling & Prediction. U. S. Government Printing Office: Washington, DC. MIL-STD-781C. 1977. Reliability Design Qualification and Production Tests: Exponential Distribution. U. S. Government Printing Office: Washington, DC. MIL-STD-1635 (EC). 1978. Reliability Growth Testing. U. S. Government Printing Office: Washington, DC. Montgomery, D. C. 2009. Introduction to Statistical Quality Control. 6th ed. New York: John Wiley & Sons. Musa, J. D., A. Iannino, and K. Okumoto. 1987. Software Reliability. New York: McGraw-Hill. Nair, V. N. 1984. Confidence bands for survival functions with censored data: A comparative study. Technometrics 26:265–75. Nelson, W. B. 1969. Hazard plotting for incomplete failure data. J Qual Technol 1:27–52. Nelson, W. B. 1972. Theory and application of hazard plotting for censored failure data. Technometrics 14:945–6. Nelson, W. B. 1975. Graphical analysis of accelerated life test data with a mix of failure modes. IEEE Trans Reliab R-24(4):230–7. Nelson, W. B. 1982. Applied Life Data Analysis. New York: John Wiley & Sons. Nelson, W. B. 1986. Graphical analysis of failure data from repairable systems. In General Electric Co. Corp. Research & Development TIS Report 86CRD114, Schenectady, NY. Nelson, W. B. 1988a. Analysis of repair data. In Proceedings Annual Reliability and Maintainability Symposium, Los Angeles, CA, 231–3. Nelson, W. B. 1988b. Graphical analysis of system repair data. J Qual Technol 20(1):24–35. Nelson, W. B. 1990. Accelerated Testing. New York: John Wiley & Sons. Nelson, W. B. 1995. Confidence limits for recurrence data—applied to cost or number of product repairs. Technometrics 37:147–57. Nelson, W. B. 2003. Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other Applications. Philadelphia, PA: SIAM. Nelson, W. B. 2004. Accelerated Testing: Statistical Models, Test Plans, and Data Analysis. Hoboken, NJ: John Wiley & Sons. Nelson, W. B. 2005. A bibliography of accelerated test plans. IEEE Trans Reliab 54(2):194–7 and Part II—References, 54(3):370–3. Nelson, W. B., and N. Doganaksoy. 1989. A computer program for an estimate and confidence limits for the mean cumulative function for cost or number of repairs of repairable products. In General Electric Co. Corp. Research & Development TIS Report 89CRD239, Schenectady, NY. Neter, J., W. Wasserman, M. H. Kutner, and C. Nachtshein. 1996. Applied Linear Statistical Models. 4th ed. Columbus, OH: McGraw-Hill/Irwin. Ntzoufras, I. 2009. Bayesian Modeling Using WinBUGS. Hoboken, NJ: Wiley Series in Computational Statistics.

References

555

O’Connor, P. D. T. 2002. Practical Reliability Engineering. 4th ed. New York: John Wiley & Sons. Ostle, B., and R. W. Mensing. 1975. Statistics in Research. 3rd ed. Ames, IA: The Iowa State University Press. Ott, R. L., and M. Longnecker. 2001. An Introduction to Statistical Methods and Data Analysis. 5th ed. Pacific Grove, CA: Duxbury Press. Ott, E. R., E. G. Schilling, and D. V. Neubauer. 2005. Process Quality Control. 4th ed. Milwaukee, WI: ASQ Quality Press. Parzen, E. 1962. Modern Probability Theory and Its Applications. New York: John Wiley & Sons. Pearson, K. 1968. Tables of the Incomplete Beta Function. 2nd ed. Cambridge, UK: Cambridge University Press. Pecht, M., P. Lall, and E. B. Hakim. 1992. The influence of temperature on integrated circuit failure mechanisms. Qual Reliab Eng Int 8:167–75. Peck, D., and O. D. Trapp. 1980. Accelerated Testing Handbook. Portola, CA: Technology Associates and Bell Telephone Laboratories. Peto, R. 1973. Experimental survival curves for interval-censored data. Appl Stat 22(1):86–91. Plackett, R. 1959. The analysis of life test data. Technometrics 1(1):9–19. Rigdon, S. E., and A. P. Basu. 1989. The power law process: A model for the reliability of repairable systems. J Qual Technol 21(6):251–60. Robbins, H. 1956. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, 157–63. Robbins, H. 1964. The empirical Bayes approach to statistical decision problems. Ann Math Stat 35:1–20. Ross, S. M. 2002. A First Course in Probability. 6th ed. New York: Macmillan. Ross, S. M. 2003. Introduction to Probability Models. 8th ed. San Diego, CA: Academic Press. Ross, S. M. 2006. Simulation. 4th ed. Burlington, MA: Elsevier Academic Press. Savage, L. J. 1954. The Foundations of Statistics. New York: John Wiley & Sons. Savage, S.L. 2003. Decision Making with Insight. Belmont, CA: Thomson Brooks/Cole. Schilling, E. G. 1982. Acceptance Sampling in Quality Control. New York: Marcel Dekker. Shewhart, W. A. 1931. Economic Control of Quality of Manufactured Product. Originally published by Van Nostrand, NY. Commemorative Issue Published by ASQ Quality Press, Milwaukee, WI, in December 1980. Shooman, M. L. 1990. Probabilistic Reliability: An Engineering Approach. 2nd ed. Malabar, FL: Robert E. Krieger Publishing. Stephens, M. A. 1974. EDF statistics for goodness-of-fit and some comparisons. J Am Stat Assoc 69:730–7. Thomas, D. R., and G. L. Grunkemeier. 1975. Confidence interval estimation of survival probabilities for censored data. J Am Stat Assoc 70:865–71. Trindade, D. C. 1975. An APL program to numerically differentiate data by the ruler method. IBM Technical Report, TR 19.0361, Burlington, VT. Trindade, D. C. 1980. Nonparametric estimation of a lifetime distribution via the renewal function. PhD. dissertation. Burlington, VT: University of Vermont. Trindade, D. C. 1991. Can burn-in screen wearout mechanisms? Reliability models of defective subpopulations—A case study. In 29th Annual Proceedings of Reliability Physics Symposium, Las Vegas, NV, 260-263. Trindade, D. C. 2009. Reliability Model for Analysis of Soft Error Field Failures. Washington, DC: Joint Statistical Meetings. Trindade, D. C., and L. Haugh. 1979. Nonparametric Estimation of a Lifetime Distribution via the Renewal Function. IBM Technical Report, TR 19.0463, Burlington, VT. Trindade, D. C., and L. Haugh. 1980. Estimation of the reliability of computer components from field renewal data. Microelectron Reliab 20:205–18. Trindade, D. C., and D. Meade. 2001. An Excel Add-In for Capturing Simulation Statistics. Atlanta, GA: Joint Statistical Meetings.

556

References

Tummula, R. R., and E. J. Rymaszewski, eds. 1989. Microelectronics Packaging Handbook. New York: Van Nostrand Reinhold. Turnbull, B. W. 1974. Nonparametric estimation of a survivorship function with doubly censored data. J Am Stat Assoc 69:169–73. Turnbull, B. W. 1976. The empirical distribution function with arbitrarily grouped, censored, and truncated data. J R Stat Soc B 38(5):290–5. Turnbull, B. W., and L. Weiss. 1978. A likelihood ratio statistic for testing goodness of fit with randomly censored data. Biometrics 34:367–75. Usher, J. S. 1993. Case study: Reliability models and misconceptions. Qual Eng 6(2):261–71. Wackerly, D. D., W. Mendenhall, and R. L. Schaeffer. 2008. Mathematical Statistics with Applications. 7th ed. Belmont, CA: Thomson Brooks/Cole. Walpole, R. E., R. H. Myers, S. l. Myers, and K. Ye. 2002. Probability and Statistics for Engineers & Scientists. Upper Saddle River, NJ: Prentice Hall. Ware, J. H., and D. L. DeMets. 1976. Reanalysis of some baboon descent data. Biometrics 32:459–63. Wasserman, G. S. 2003. Reliability Verification, Testing, and Analysis in Engineering Design. New York: Marcel Dekker. Wasserman, G. S., and I. S. Reddy. 1992. Practical alternatives for estimating the failure probabilities of censored life data. Qual Reliab Eng Int 8:61–7. Weibull, W. 1951. A statistical distribution function of wide applicability. J Appl Mech 18:293–7. Western Electric Company. 1958. Statistical Quality Control Handbook. 2nd ed. Easton, PA: Mack. Weston, S. A., and W. Q. Meeker. 1991. Coverage probabilities of nonparametric simultaneous confidence bands for a survival function. J Stat Comput Simul 32:83–97. Wilks, S. S. 1962. Mathematical Statistics. New York: John Wiley & Sons. Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209–12. Winston, W. A. 2007. Microsoft Office Excel 2007 Data Analysis and Business Modeling. Redmond, WA: Microsoft Press. Xie, M. 1991. Software Reliability Modeling. New Jersey: World Scientific. Zaino Jr., N. A., and T. M. Berke. 1992. Determining the effectiveness of run-in: A case study in the analysis of repairable-system data. In Proceedings Annual Reliability and Maintainability Symposium, Las Vegas, NV, 58–70. Zhang, J., and Y. Wu. 2002. Beta approximation to the distribution of kolmogorov-smirnov statistic. Ann Inst Stat Math 54(3):577–84. Ziegler, J. F. 1996. Terrestrial cosmic rays. IBM J Res Dev 40:19–40.

Statistics

Applied Reliability Since the publication of the second edition of Applied Reliability in 1995, the ready availability of inexpensive, powerful statistical software has changed the way statisticians and engineers look at and analyze all kinds of data. Problems in reliability that were once difficult and time consuming even for experts can now be solved with a few well-chosen clicks of a mouse. However, software documentation has had difficulty keeping up with the enhanced functionality added to new releases, especially in specialized areas such as reliability analysis. Using analysis capabilities in spreadsheet software and two well-maintained, supported, and frequently updated, popular software packages—Minitab and SAS JMP—the third edition of Applied Reliability is an easy-to-use guide to basic descriptive statistics, reliability concepts, and the properties of lifetime distributions such as the exponential, Weibull, and lognormal. The material covers reliability data plotting, acceleration models, life test data analysis, systems models, and much more. The third edition includes a new chapter on Bayesian reliability analysis and expanded, updated coverage of repairable system modeling.

Applied Reliability Third Edition Paul A. Tobias David C. Trindade

Tobias Trindade

Taking a practical and example-oriented approach to reliability analysis, this book provides detailed illustrations of software implementation throughout and more than 150 worked-out examples done with JMP, Minitab, and several spreadsheet programs. In addition, there are nearly 300 figures, hundreds of exercises, and additional problems at the end of each chapter, and new material throughout.

Applied Reliability

Third Edition

Third Edition

C4665

C4665_Cover.indd 1

7/19/11 12:13 PM

Applied Reliability, Third Edition - Paul A. & Tobias.pdf

Short Description

Description

Comments

We need your help!