QUALITY BASICS
Root Cause Analysis For Beginners by James J. Rooney and Lee N. Vanden Heuvel
R
oot cause analysis (RCA) is a process designed for use in investigating and categorizing the root causes of events with safety, health, environmental, quality, reliability and production impacts. The term “event” is used to
In 50 Words Or Less • Root cause cause analysis analysis helps identif identifyy what, what, how and why something happened, thus preventing recurrence. • Root causes are underlyi underlying, ng, are are reasonab reasonably ly identifiable, can be controlled by management and allow for generation of recommendations. • The proces processs involves involves data data collecti collection, on, cause cause charting, root cause identification and recommendation generation and implementation.
generically identify occurrences that produce or have the potential to produce these types of consequences. Simply stated, RCA is a tool designed to help identify not only what and how an event occurred, but also why it happened. Only when investigators are able to determine why an event or failure occurred will they be able to specify workable corrective measures that prevent future events of the type observed. Understanding why an event occurred is the key to developing effective recommendations. Imagine an occurrence during which an operator is instructed to close valve A; instead, the operator closes valve B. The typical investigation would probably conclude operator error was the cause. This is an accurate description of what happened and how it happened. However, if the analysts stop here, they have not probed deeply enough to understand the reasons for the mistake. Therefore, they do not know what to do to prevent it from occurring again. In the case of the operator who turned the wrong valve, we are likely to see recommendations such as retrain the operator on the procedure, remind all operators to be alert when
QUALITY PROGRESS
I
JULY 2004
I
45
QUALITY BASICS
manipulating valves or emphasize to all personnel that careful attention to the job should be maintained at all times. Such recommendations do little to prevent future occurrences. Generally, mistakes do not just happen but can be traced to some well-defined causes. In the case of the valve error, we might ask, “Was the procedure confusing? Were the valves clearly labeled? Was the operator familiar with this particular task?” The answers to these and other questions will help determine why the error took place and what the organization can do to prevent recur-
Identifying “severe weather” as the root cause of parts not being delivered on time to customers is not appropriate.
rence. In the case of the valve error, example recommendationss might include revising the recommendation procedure or performing procedure validation to ensure references references to valves match the valve labels found in the field. Identifying root causes is the key to preventing similar recurrences. An added benefit of an effective RCA is that, over time, the root causes identified across the population of occurrences can be used to target major opportunities for improvement. If, for example, a significant number of analyses point to procurement inadequacies, then t hen resources can be focused on improvement of this management system. Trending Trending of root causes c auses allows development of systematic improvements and assessment of the impact of corrective programs.
Definition Although there is substantial debate on the definition of root cause, we use the t he following: 1. Root causes causes are specific underly underlying ing causes. 46
I
JULY 2004
I
www.asq.org
2. Root causes causes are are those those that that can reaso reasonably nably be identified. 3. Root causes causes are are those those manageme management nt has control to fix. 4. Root causes causes are are those for which which effective effective recrecommendations for preventing recurrences can be generated. Root causes are underlying causes. The investigator’s goal should be to identify specific underlying causes. The more specific the investigator can be about why an event occurred, the easier it will be to arrive at recommenda recommendations tions that will prevent recurrence. Root causes are those that can reasonably be identified. Occurrence investigations investigations must be cost beneficial. It is not practical to keep valuable manpower occupied indefinitely searching for the root causes of occurrences. Structured RCA helps analysts get the most out of the time they have invested in the investigation. Root causes are those over which management has control. Analysts should avoid using general cause classifications such as operator error, equipment failure or external factor. Such causes are not specific enough to allow management to make effective changes. Management needs to know exactly why a failure occurred before action can be taken to prevent recurrence. We must also identify a root cause that management can influence. Identifying “severe weather” as the root cause of parts not being delivered on time to customers is not appropriate. Severe weather is not controlled by management. Root causes are those for which effective recommendations can be generated. Recommendations should directly address the root causes identified during the investigation. If the analysts arrive at vague recommendations such as, “Improve adherence to written policies and procedures,” then they probably have not found a basic and specific enough cause and need to expend more effort in the analysis process.
Four Major Steps The RCA is a four-step process involving involving the following: 1. Da Data ta col colle lecti ction on.. 2. Cau Causal sal fact factor or cha chartin rting. g.
Causal Factor Chart
FIGURE 1
Burner
Part one
Electric burner shorts out CF
Pan
Arcing heats bottom of aluminum pan
Had it not been originally charged? Fire extinguisher
Pan
Jane
Had it leaked?
Aluminum melts, forming hole in pan
Fire extinguisher, floor
Jane comes to the door Conclusion
Jane, Mary
How much oil is used? How much chicken? Chicken, pan, oil Mary
Mary begins frying chicken 5:00 pm
Pan
Mary uses an aluminum pan
CF
Grease ignites when it contacts burner
Jane rings the doorbell
What exactly did she see?
Had it been previously used?
Mary
Inspection tag
Assumed
Fire generates smoke
Mary
Mary
Mary sees the fire on the stove
Fire extinguisher is not charged
Mary
Mary
Mary tries to use the fire extinguisher
Fire extinguisher does not operate when Mary tries to use it
Mary
Mary
Mary leaves the frying chicken unattended CF
Fire starts on the stove
Mary
Mary meets with Jane
Jane, Mary
Smoke detector alarms
Mary
Mary runs into the kitchen
CF
About 5:10 pm
Mary
10 minutes
Mary pulls the plug on the fire extinguisher Is "plug" the same as pin?
Does Mary know how to use a fire extinguisher?
Mary
Mary
= Causal factor
Figure 1 continued on next page
QUALITY PROGRESS
I
JULY 2004
I
47
QUALITY BASICS
Part two Did she know this was wrong? Lack of practice fighting fires? Did she do anything else?
What is Jane doing during this time? Mary, Jane
Mary Mary, pan
Mary
Was Mary trying to do this?
How long did it take for the FD to arrive?
Fire was a grease fire
FD dispatcher
Mary Mary
Mary throws water on the fire
Kitchen, Mary
Fire spreads throughout the kitchen CF
Mary, FD
Mary calls the fire department Time?
3. Root caus causee identi identifica fication tion.. 4. Recommend Recommendation ation genera generation tion and implemen implementatation. Step one—data collection. The first step in the analysis is to gather data. Without complete information and an understanding of the event, the causal factors and root causes associated with the event cannot be identified. The majority of time spent analyzing an event is spent in gathering data. Step two—Causal factor charting. Causal factor charting provides a structure for investigators to organize and analyze the information gathered during the investigation and identify gaps and deficiencies in knowledge as the investigation progresses. The causal factor chart is simply a sequence diagram with logic tests that describes the events leading up to an occurrence, plus the conditions surrounding these events (see Figure 1, p. 47). Preparation of the causal factor chart should begin as soon as investigators start to collect information about the occurrence. They begin with a skeleton chart that is modified as more relevant facts are uncovered. The causal factor chart should 48
I
JULY 2004
I
www.asq.org
Did the FD use the correct techniques?
Observation
Fire department arrives Time?
FD
FD, observation
Kitchen destroyed by fire
Fire department puts out fire Time?
Other losses from smoke and water damage?
drive the data collection process by identifying data needs. Data collection continues until the investigators are satisfied with the thoroughness of the chart (and hence are satisfied with the thoroughness of the investigation). When the entire occurrence has been charted out, the investigators are in a good position to identify the major contributors to the incident, called causal factors. Causal factors are those contributors (human errors and component failures) that, if eliminated, would have either prevented the occurrence or reduced its severity. In many traditional analyses, the most visible causal factor is given all the attention. Rarely, however, is there just one causal factor; events are usually the result of a combination of contributors. When only one obvious causal factor is addressed, the list of recommendations will likely not be complete. Consequently, the occurrence may repeat itself because the organization did not learn all that it could from the event. Step three—root cause identification. After all the causal factors have been identified, the investigators begin root cause identification. This step
involves the use of a decision diagram called the Root Cause Map (see Figure 2, p. 50) to identify the underlying reason or reasons for each causal factor. The map structures the reasoning process of the investigators by helping them answer questions about why particular causal factors exist or occurred. The identification of root causes helps the investigator determine the reasons the event occurred so the problems surrounding the occurrence can be addressed. Step four—recommendation generation and implementation. The next step is the t he generation of recommendations. recommendati ons. Following identification of the root causes for a particular causal factor, achievable recommendations for preventing its recurrence are then generated. The root cause analyst is often not responsible for the implementation of recommenda recommendations tions generated by the analysis. However, if the recommendations are not implemented, the effort expended in performing the analysis is wasted. In addition, the events that triggered the analysis should be expected to recur. Organizations need to ensure that recommendations are tracked to completion.
Presentation of Results Root cause summary tables (see Table 1, p. 52) can organize the information compiled during data analysis, root cause identification and recommendation generation. Each column represents a major aspect of the RCA process. • In the first first column, a genera generall description description of the causal factor is presented along with sufficient background information for the reader to be able to understand the need to address this causal factor. • The second second column column shows shows the Path Path or Paths Paths through the Root Cause Map associated with the causal factor. • The third third column presents presents recommend recommendations ations to address each of the root causes identified. Use of this three-column format aids the investigator in ensuring root causes and recommendations are developed for each causal factor. The end result of an RCA investigation is generally an investigation report. The format of the report is usually well defined by the administrative documents governing the particular reporting sys-
tem, but the completed causal factor chart and causal factor summary tables provide most of the information required required by most reporting systems.
Example Problem The following example is nontechnical, allowing the reader to focus on the analysis process and not the technical aspects of the situation. The following narrative is the account of the event according to Mary: It was 5 p.m. I was frying chicken. My friend Jane stopped by on her way home from the doctor, and she was very upset. I invited her into the living room so we could talk. After about 10 minutes, the smoke detector near the kitchen came on. I ran into the kitchen and found a fire on the stove. I reached for the fire extinguisher and pulled the plug. Nothing happened. The fire extinguisher was not charged. In desperation, I threw water on the fire. The fire spread throughout the kitchen. I called the fire department, but the kitchen was destroyed. The fire department arrived in time to save the rest of the house.
Data gathering began as soon as possible after the event to prevent loss or alteration of the data. The RCA team toured the area as soon as the fire
In many traditional analyses, the most visible causal factor is given all the attention. department declared it safe. Because data from people are the most fragile, Mary, Jane and the firefighters were interviewed immediately after the fire. Photographs were taken to record physical and position data. The analysts then developed the causal factor chart (see Figure 1, p. 47) to clearly define the sequence of events that led to the fire. The causal factor chart begins with the event; Mary begins frying chicken at 5 p.m. As the chart develops from QUALITY PROGRESS
I
JULY 2004
I
49
QUALITY BASICS
Root Cause Map
FIGURE 2
Start here with each causal factor.
Section one
1 1
Equipmentdifficulty
Equipment design problem
2
Equipment reliability program problem 6
5
Installation/ fabrication
Equipment misuse
7
8 2
Design input/ output 15
Design input LTA 16 Design output LTA 17
Equipment records
Equipment reliability program design less than adequate (LTA) 21
18
Equipment design records LTA 19
No program
22
Program LTA 23 • Analysis/design procedure LTA 24 • Inappropriat Inappropriate e type of maintenance assigned 25 • Risk acceptance acceptance criteria LTA 26 • Allocation of resources LTA 27
Equipment operating/ maintenance history LTA 20
Note: Node numbers correspond to matching page in Appendix A of the Root Cause Analysis Handbook.
Standards, policies or administrative controls (SPACs) LTA 57 • No SPACs 59 • Not strict strict enough 60 • Confusing, contradictory or incomplete 61 • Technical error error 62 • Responsibility for item/activity not adequately defined 63 • Planning, scheduling scheduling or tracking of work activities LTA 64 • Rewards/incentives LTA 65 • Employee screening/ screening/ hiring LTA 66
Safety/hazard/ risk review 72 • Review LTA or not performed 74 • Recommendations not yet implemented 75 • Risk acceptance criteria LTA 76 • Review procedure LTA 77
SPACs not used 67 • Communication of SPACs LTA 69 • Recently changed changed • Enforcement LTA
70 71
Equipment reliabilit reliabilityy programimplementation LTA 28
Corrective maintenance LTA 29 • Troubleshooting/ Troubleshooting/corrective corrective action LTA 30 • Repair implementation implementation LTA 31 Preventive maintenance LTA 32 • Frequency LTA 33 • Scope LTA 34 • Activity implementation implementation LTA 35 Predictive maintenance LTA 36 • Detection LTA 37 • Monitoring LTA 38 • Troubleshooting/ corrective action LTA 39 • Activity implementation implementation LTA 40
Product/material control 85 • Handling LTA 87 • Storage LTA 88 • Packaging/ shipping LTA 89 • Unauthorized material material substitution 90 • Product acceptance acceptance criteria LTA 91 • Product inspections nspections LTA 92
Not used 112 • Not available available or inconvenient to obtain 113 • Procedure difficult difficult to use 114 • Use not required required but should be 115 • No procedure for task 116
50
I
JULY 2004
I
www.asq.org
Procedures 111
Proactive maintenance LTA 41 • Event specification specification LTA 42 • Monitoring LTA 43 • Scope LTA 44 • Activity implementation implementation LTA 45 Failure finding maintenance LTA 46 • Frequency LTA 47 • Scope LTA 48 • Troubleshooting/ corrective action LTA 49 • Repair implementation implementation 50 Routine equipment rounds LTA 51 • Frequency LTA 52 • Scope LTA 53 • Activity implementation implementation LTA 54
Procurement control 93 • Purchasing specifications LTA 95 • Control of changes to procurement specifications LTA 96 • Material acceptance requirements LTA 97 • Material inspections LTA 98 • Contractor selection selection LTA 99
Problem identification control 78 • Problem reporting reporting LTA 80 • Problem analysis LTA 81 • Audits LTA 82 • Corrective action action LTA 83 • Corrective actions actions not yet implemented 84
Administrative/ management systems 55
Document and configuration control 100 • Change not identified 102 • Verification of design/ field changes LTA (no PSSR*) 103 • Documentation content not kept up to date 104 • Control of official official documents LTA 105
Misleading/confusing 117 • Format confusing or LTA 118 • More than one action per step 120 • No checkoff space provided but should be 121 • Inadequate checklist checklist 122 • Graphics LTA 123 • Ambiguous or confusing confusing instructions/ requirements 124 • Data/computations wrong/incomplete 125 • Insufficient or excessive references 126 • Identificati Identification on of revised steps LTA 127 • Level of detail LTA 128 • Difficult Dif ficult to identify 129
Customer interface/ services 106 • Customer requirements not identified 108 • Customer needs not addressed 109 • Implementation LTA 110
Wrong/incomplete 130 • Typographical error 131 • Sequence wrong wrong 132 • Facts wrong/ wrong/ requirements not correct 133 • Wrong revision on or expired procedure revision used 134 • Inconsistency between requirements 135 • Incomplete/sit Incomplete/situation uation not covered 136 • Overlap or gaps between procedures 137
Figure 2 continued on next page
Start here with each causal factor.
Section Two
1
1 Personal difficulty
Company employee
9
Other difficulty
3
Contract employee
Natural phenomena
10
4
Sabotage/ horseplay
11
12
External events
Other 13
14
2
Human factors engineering 138
No training 164 • Decision Decisi on not to train 165 • Training requirements not identified 166
Immediate supervision
Training 163
Training records system LTA 167 • Training Traini ng records incorrect 168 • Training Traini ng records not up to date 169
No communication or not timely 194 • Method unavailable or LTA 195 • Communication Communicat ion between work groups LTA 196 • Communication Communicat ion between shifts and management LTA 197 • Communication Communicat ion with contractors LTA 198 • Communication Communicat ion with customers LTA 199
Training LTA 170 • Job/task analysis LTA 171 • Program Progr am design/ objectives LTA 172 • Lesson content LTA 174 • On-the-job training LTA 175 • Qualification testing LTA 176 • Continuing training LTA 177 • Training resources LTA 178 • Abnormal events/ emergency training LTA 179
Misunderstood communication 200 • Standard terminology not used 201 • Verification/ repeat back not used 202 • Long message 203
Wrong instructions
180
Communications 192
Job turnover LTA 205 • Communication within shifts LTA 206 • Communication between shifts LTA 207
208
Problem detection LTA 209 *Sensory/perceptual capabilities LTA 210 *Reasoning capabilities LTA 211
Preparation 181 • No preparation 182 • Job plan LTA 183 • Instructions to workers LTA 184 • Walkthrough Walkt hrough LTA 185 • Scheduling Schedul ing LTA 186 • Worker selection/ assignment LTA 187 Supervision during work 188 • Supervision Super vision LTA 189 • Improper performance not corrected 190 • Teamwork LTA 191
204
Personal performance
*Motor/physical capabilities LTA
212
*Attitude/attention LTA 213 *Rest/sleep LTA (fatigue) 214 *Personal/medication problems 215
*PSSR = Project scope summary report
Shape
Description Primary difficulty source Problem category
Workplace layout 140 • Controls/displays LTA 141 • Control/display integration/ arrangement LTA 143 • Location of controls/displays LTA 144 • Conflicting Confl icting layouts 145 • Equipment location LTA 146 • Labeling of equipment or locations LTA 147
Work environment 148 • Housekeeping Housekeepi ng LTA 149 • Tools LTA 150 • Protective Protectiv e clothing/ equipment LTA 151 • Ambient conditions LTA 152 • Other Ot her environmental stresses excessive 154
Workload 155 • Excessive control action requirements 156 • Unrealistic monitoring requirements 157 • Knowledge Knowl edge based decision required 158 • Excessive calculation or data manipulation required 159
Intolerant system 160 • Errors Erro rs not detectable 161 • Errors Erro rs not correctable 162
Root cause category Near root cause Root cause © 1995, 1997, 1999, 2000 and 2001, ABSG Consulting Inc.
*Note: These nodes are for descriptive purposes only.
QUALITY PROGRESS
I
JULY 2004
I
51
QUALITY BASICS
Root Cause Summary Table
TABLE 1
Event description: Kitchen is destroyed by fire and damaged by smoke and water. Causal factor # 1 Description: Mary leaves the frying chicken unattended.
Paths Th Through Ro Root Ca Cause Ma Map • Personnel difficulty. • Administrative Administrative/manageme /management nt systems. • Standards, polici policies es or administrative controls (SPACs) less than adequate (LTA). • No SPACs.
Causal factor # 2 Description: Electric burner element fails (shorts out).
Paths Th Through Ro Root Ca Cause Ma Map • Equipment difficulty. • Equipment reliability reliability program problem. problem. • Equipment reliability reliability program design design LTA. • No program.
Causal factor # 3 Description: Fire extinguisher does not operate when Mary tries to use it.
Causal factor # 4 Description: Mary throws water on fire.
Paths Through Root Cause Map is a trademark of ABSG Consulting.
52
I
JULY 2004
I
www.asq.org
Paths Th Through Ro Root Ca Cause Ma Map
Event #: 2003-1 Recommendations • Implement a policy that hot oil is never left unattended on the stove. • Determine whether whether policies should should be developed for other types of hazards in the facility to ensure they are not left unattended. • Modify the risk assessment assessment process or procedure development process to address requirements for personnel attendance during process operations.
Recommendations • Replace all burners on stove. • Develop a preventive maintenance strategy to periodically replace the burner elements. • Consider alternative alternative methods for preparing preparing chicken that may involve fewer hazards, such as baking the chicken or purchasing the finished product from a supplier.
Recommendations
• • • •
Equipment difficulty. Equipment reliability reliability program problem. problem. Equipment proactive maintenance LTA. Activity implementation implementation LTA. LTA.
• Refill the fire extinguisher. extinguisher. • Inspect other other fire extinguishers in the facility to ensure they are full. • Have incident reports describing the use of fire protection equipment routed to maintenance maintenan ce to trigger refilling of the fire extinguishers.
• • • •
Equipment difficulty. Equipment reliability reliability program problem. problem. Administrative/manageme Administrative /management nt systems. Problem identification identification and control control LTA.
• Add this fire extinguisher extinguisher to the audit list. • Verify that all fire extinguishers extinguishers are on the quarterly fire extinguisher audit list. • Have all maintenance work requests that involve fire protection equipment routed to the safety engineer so the quarterly checklists can be modified as required.
Paths Th Through Ro Root Ca Cause Ma Map • Personnel difficulty. • Company employee. employee. • Training. • Training LTA. • Abnormal events/emergency events/emergency training training LTA.
Recommendations • Provide practical practical (hands-on) training training on the use of fire extinguishers. Classroom training may be insufficient to adequately learn this skill. • Review other skill based activities activities to ensure appropriate level of hands-on training is provided. • Review the training development process to ensure adequate guidance is provided for determining the proper training setting (for example,classroom, example,classr oom, lab, simulator, on the job training, computer based training).
left to right, the sequences begin to unfold. The loss events—kitchen destroyed destroyed by fire and other losses from smoke and water damage—are the shaded rectangles in the causal factor chart. Although we read the chart from left to right, it is developed from right to left (backwards). Development always starts at the end because that is always a known fact. Logic and time tests are used to build the chart back to the beginning of the event. Numerous questions are usually generated that identify additional necessary data. After the causal factor chart was complete (additional data were gathered to answer the questions shown in Figure 1), the analysts identified the factors that influenced the course of events. There are four causal factors for this t his event (see Table 1). Elimination of these causal factors would have either prevented the occurrence or reduced its severity. Note the recommendations in Table 1 are written as if Mary’s house were an industrial facility. Notice that causal factor two may be unexpected. It wasn’t overheating of the oil or splattering of the oil that ignited the fire. If the wrong causal factor is identified, the wrong corrective actions will be developed. The application of the technique identified that the electric burner element failed by shorting out. The short melted Mary’s aluminum pan, releasing the oil onto the hot burner, starting the fire. The analyst must be willing to probe the data first to determine what happened during the occurrence, second to describe how it happened, and why.. third to understand why
Root Cause Analysis Handbook, Handbook , WSRC-IM-91-3, Department of Energy, 1991 (and earlier versions). Root Cause Analysis Handbook: A Guide to Effective Investigation,, ABSG Consulting Inc., 1999. Investigation User’s Guide for Reactor Incident Root Cause Coding Tree , revision five, DPST-87-209, E.I. duPont de Nemours, Savannah River Laboratory, 1986.
is a senior risk and reliability engineer with ABSG Consulting C onsulting Inc.’s Inc.’s Risk Consulting C onsulting Division in Knoxville, TN. He earned a master’s master ’s degree degree in nuclear engineering from the University of Tennessee. Rooney is a Fellow of ASQ and an ASQ certified quality auditor auditor,, quality auditor-hazard analysis and critical control points, quality engineer,, quality improvement associate, quality manager and neer reliability engineer. JAMES J. ROONEY
LEE N. VANDEN HEUVEL is a senior risk and reliability
engineer with ABSG Consulting C onsulting Inc.’s Inc.’s Risk Consulting C onsulting Division in Knoxville, TN. He earned a master’s degree in nuclear engineering from the University of Wisconsin. Vanden Heuvel co-authored the Root Cause Analysis Handbook: A Guide to Effective Incident Investigation,, co-developed the RootCause Leader software and was tion a co-author of the Center for Chemical Process Safety’s Guidelines for Investigating Chemical Process Incidents.. He develops and teaches courses on the subject. Incidents
BIBLIOGRAPHY
Accident/Incident Investigation Manual , second edition, Accident/Incident DOE/SSDC 76-45/27, Department of Energy. Events and Causal Factors Charting, Charting , DOE/SSDC 76-45/14, Department of Energy, 1985. Ferry, Ted S., Modern S., Modern Accident Investigation and Analysis , second edition, John Wiley and Sons, 1988. Guidelines for Investigating Chemical Process Incidents , American Institute of Chemical Engineers, Center for Chemical Process Safety, 1992. Occupational Safety and Health Administration Accident Investigation Course, Office of Training and Education, 1993.
Please comment If you would like to comment on this article, please post your remarks on the Quality Progress Discussion Board at www.asq.org, or e-mail them to
[email protected].
QUALITY PROGRESS
I
JULY 2004
I
53