Root Cause Analysis
When something goes wrong at your facility, it can be tempting to direct your efforts at quickly fixing the problem. But often, acting too quickly can fix the symptoms of failure without addressing the root cause, leaving the probability of future failures high. Root cause analysis can help future-proof your facility by investigating why failures happen, leading to changes in procedures, processes, or design that can prevent similar failures from occurring unexpectedly.
Root cause analysis (RCA) is an essential tool in the reliability toolbox. Root cause analysis allows facilities to prevent future instances of failure by tracing the root cause of events with safety, health, environmental, reliability or production impacts, rather than simply correcting the proximate or immediate cause of the failure. RCA shares many features with Failure Modes and Effects Analysis (FMEA), with one major difference: an FMEA is performed prior to failure events using experience, facility and asset history, manufacturer manuals, and operating context to determine dominant causes of failures and develop mitigating strategies in advance. It is a proactive strategy intended to reduce the incidence of failures before they happen.
Root cause analysis is primarily a reactive strategy that addresses failure causes after they have happened. Not all failures will be caught by an FMEA since new assets, new technology, or process changes may limit the available information about dominant failure modes. When failures do occur, it is important to recognize not just the immediate failure, but the ultimate origin or cause of that failure. Once the root cause has been determined, this information can be used to update an FMEA and make changes to processes, procedures, or facility design to prevent future failure events and improve the reliability of your facility.
Although RCA is considered a reactive strategy, it should be differentiated from simple reactive responses. Determining the root cause of a failure requires a knowledgeable team, objective evidence, investigation of a variety of potential causes, careful analysis of the facts, the commitment of management and staff, and insight born of experience with RCA. Quickly done RCA is often poorly done RCA that leaves out crucial information, or rushes to judgement based on assumptions, rather than objective facts.
The steps involved in RCA may seem simple, but they require expertise and experience to arrive at effective solutions. General guidance for root cause analysis and how it can help your facility are detailed below.
Root Causes Analysis Steps
Contain the Failure
When a failure occurs, the first step should always be containment to prevent the failure from causing further risk to health, safety, environment or process integrity. But containment should be performed with an eye to preserving the site of the failure for future analysis. For example, before loosening bolts, staff should verify that they were correctly tightened when the failure occurred. Acting too quickly to contain the failure can lead to muddying the waters and make RCA more complicated and time-consuming. Where possible, photographs of the site should be taken immediately and during the process of containment to verify the condition of assets, asset settings, and other potentially useful site information. Seemingly inconsequential data can be lost to enthusiastic containment measures. Having a corrective action procedure in place to guide staff through data preservation during containment can prevent the loss of information during this time. It is useful to begin collecting information about the failure event as early as possible in the containment process, including gathering eyewitness statements before memories become clouded by activities taken to contain the failure.
Once the failure has been contained and risks to health, safety, and environment have been addressed, the process of gathering further documentation can proceed. Staff involved in the failure, as witnesses or participants in the containment, should be interviewed and written statements should be gathered within 24 hours of the failure event. In general, these statements should respond to general questions regarding who, what, when, and where the failure occurred, and what the operating condition was that preceded the failure (e.g. temperature, pressure, lighting, and working conditions that might have impacted the failure). In the absence of gross negligence, staff should be assured that their statements will not be held against them to ensure statements contain the most accurate information and reduce the tendency for personnel to mystify events due to concern for their jobs.
In addition to personnel statements, all physical documentation should be gathered and protected from alteration. Depending on the type of failure, this documentation might include training documents and records to ensure personnel received adequate training, standard operating procedures to ensure that all steps were adequately detailed, asset logs and monitoring records, maintenance and inspection records, etc. In general, it is better to err on the side of too much rather than too little data collection. Apparently insignificant records, such as visitor logs or email messages, may indicate problems that may not immediately be recognized. Information gathering will continue to occur over the course of the RCA when additional information is required, but this early stage of information gathering is essential to starting the RCA on the right track.
Build the RCA Team
Once all relevant data has been collected, it is time to develop a team to analyze the data. Where possible, a person familiar with RCA should be selected to lead the team. RCA facilitation is more complicated than it may appear. The value of the process requires open and honest communication, and a willingness to listen to the viewpoints of others. A good facilitator should be able to guide the team in brainstorming and problem-solving, and be able to deal with potential conflicts that may erupt when team members disagree. In addition, a good facilitator will have an understanding of the assets and process under investigation to help narrow options and drill down deep enough to determine root causes.
Solid facilitation of root cause analysis helps keep the team on track and can provide more comprehensive results. In general, the team should be made up of technical personnel who understand the assets under investigation, process personnel who work with the assets and know both the written and unwritten procedures for managing the assets, technological personnel who understand the control strategies and process logic of monitoring and control equipment, and additional members as needed depending on the type of failure. Asset manufacturer representatives may be useful where asset malfunction is suspected. It may also be useful to have a team member assigned to communications to ensure that any additional data or follow up can be collected in a timely manner.
Once data has been collected, and the team has been assembled, the process of classifying the data and collecting any additional information can proceed. During this process, the team will determine what information is known, what data is required, what data needs to be verified, and what data is primarily speculation. Unknown, uncollected, or unverified data may require additional collection, testing, or logical assessment to be useful in the analysis. For example, personnel statements about the failure event may require additional verification to separate opinion from fact.
One method for keeping track of what information is verified and what requires additional testing or validation is the use of a KNOT chart. KNOT is an acronym for Know, Need to know, Opinion, and Think we know. Information that is regarded as credible and complete is placed in the Know column. Information that requires additional data collection, verification or testing is placed in the Need to know, Opinion, or Think we know columns and then actions are assigned based on filling the gaps necessary to move this information into the Know column. The RCA should make use of information that has been verified rather than assumptions.
Construct a Failure Timeline
In order to track back to the root cause, a clear timeline of events should be established to understand all potentially causative events that led to the failure. Available monitoring data or telemetry should correspond to the timeline. The timeline should be sufficiently detailed to provide a concrete starting point. If one goes back too far, causes will become increasingly abstract. Knowing that an operator overslept and was late for work may have relevance in indicating the operator’s state of mind or level of distraction, which may have led to missing key information that indicated a failure was likely. Knowing that some 40 years ago the operator was born is unlikely to impact the investigation. Choosing an appropriate point before additional information becomes too abstract or irrelevant is one of the roles of the facilitator. Tracking back to the best available start point may require some adjustment over the course of the investigation, as apparently irrelevant information may become relevant depending on the facts of the case.
Map the Process
Process mapping can help provide a clear reference for the team as they track back to the root cause. Understanding the relationship between upstream and downstream elements of the process may help the RCA team identify failure causes further back than the initial failure would indicate. Knowing the relationship between the various process elements, and how a minor change early in the process might lead to a cascade effect that is only recognized late in the process is critical to understanding the root cause.
It is important that the process map identify changes that may have recently been made to the process or to control processes. Over time, changes made to process control strategies may have deviated from the design intent. Understanding both the design intent of the facility and the current or as-is condition can impact the root cause analysis. If current conditions have not been tracked or documented, or if operations staff have deviated from written standard operating procedures to respond to changing conditions, these changes should be documented and addressed during the RCA. Any changes initiated prior to the failure event should be given particular scrutiny.
Root Cause Corrective Actions
Once the team has been gathered, data has been collected and classified, further testing and verification have occurred, a timeline has been sketched out, and the team has full process information and current conditions available to them, the full RCA and corrective action analysis can begin. A root cause corrective action (RCCA) plan can be set based on the results of the analysis.
There are several tools that can be used to facilitate an RCCA. Each tool has its benefits, and a variety of tools can be brought into the analysis if one is found wanting. The facilitator should be able to transfer information from one tool to another, depending on the needs of the team. Two of the most common and reliable tools for facilitating root cause analysis are briefly described below.
The Ishikawa fish diagram is exceptionally useful for turning a brainstorming session into a logical map of potential failure causes. The fishbone diagram was developed in the manufacturing industry to improve quality control, but can be used for any process industry. Depending on the type of failure under investigation, it may be useful to use generic headings to identify key failure points. The most common are People, Methods, Machines, Materials, Measurement, and Environment.
These categories can be narrowed as the team moves beyond proximate causes to hone in on the root cause. Sometimes causes will overlap. For example, a pump failure may be the result of poor maintenance (people) and also incorrect lubrication or the use of the wrong oil (materials). These elements may be compounded as poor maintenance and incorrect lubrication may lead to early wear on components (machinery). As potential failures are noted within each category, this crossover may help identify an unexpected root cause (e.g. the lack of proper maintenance, leading to lubrication errors, leading to early wear, leading to pump failure under extreme weather conditions). The proximate or direct cause may appear to be extreme weather, but actually be caused by the lack of proper maintenance. Fixing the pump will solve the immediate concern, but until maintenance accuracy is improved, one would expect to see more pump failures over time.
The fault tree is a method frequently used by NASA and the nuclear power industry. A fault tree is a deductive, backward looking analysis that arrives at the failure cause at the end of the analysis. Fault tree analysis (FTA) produces a logical sequence of events necessary to produce the failure. The end result is the beginning of the fault chain. Each step in the chain is a Boolean logic AND or OR step, requiring each step to logically precede the next.
FTA is a more systematic approach than the fishbone diagram, which logically groups potential causes brainstormed by the team, but does not require the construction of the causal chain. A fishbone diagram can be a useful starting point that allows for freeform generation of plausible ideas. A fault tree can be used to then detail the causal relationships between these ideas in a logical chain. For the investigation of specific failures, FTA can provide the detail and depth required to improved design or determine mitigating strategies to prevent future failures.
Additional methods may be useful for determining root cause, including SixSigma 5 Whys; Pareto Chart; Current Reality Tree (CRT); and FMEA. Which method is used may depend on the type of failure under investigation, but additional tools can be brought in later in the process to provide more depth or understanding at any point. Combinations of these tools may be the most productive. SixSigma’s 5 Whys method, for example, can add depth to the development of a fishbone diagram by forcing the team to provide more valuable detail for each segment. The Current Reality Tree can be useful when multiple failures are experienced simultaneously. Mapping these failures and their potential causes may lead to a single root cause for several apparent failures. A Pareto chart can be used when there are multiple probable causes contributing to a single failure to help narrow in on the causes that most impact the failure under investigation. Although it is primarily a predictive tool, an FMEA can be valuable in narrowing down failure causes and providing the team with a head start in problem-solving.
In addition, a variety of software programs are available to help guide users through the process of RCA, but these should not be considered a replacement for an experienced facilitator. RCA software can be a useful tool, but the purpose of a facilitator is to walk the team through the process, producing the desired level of detail. Like most software programs, the value depends on the quality of the information used to populate the program.
RCA, or RCCA, should ultimately result in an action or series of actions to be taken to prevent future failures. These actions may be asset or design changes, redesign of process elements, changes to maintenance strategies, documentation, operator routine duties, process control strategies, or changes relating to staffing and personnel. Often, more than one mitigating strategy will be necessary to limit the incidence or impact of future failures. Monitoring should be conducted to assess the success of the investigation in resolving the problem.
Why RCA Fails
RCA can be considered successful when it identifies elements that, when removed from the timeline, prevent the failure from occurring or significantly reduces the impact of the failure. RCA can fail when it is not performed with adequate detail, sufficient support from management, or due to time or financial constraints. One of the most common reasons that RCA fails is the rush to judgement: the RCA team relies on assumptions about the likely root cause, and seeks confirmation for their assumptions rather than allowing the facts to speak for themselves. This is one reason why it is important to have an experienced facilitator lead the RCA process. Other issues with RCA include poor team composition, without all relevant personnel or experts available to provide input. If the root cause is a problem with process logic, and the electrical team or SCADA programmer is not present, the analysis may veer off course and focus on personnel, materials, or management issues that are secondary to the root cause. Lack of communication or available resources can further derail an RCA, as missing information will not be included in the analysis.
While minor failures with low impact on safety, environment or regulatory requirements may be resolved with rapid problem-solving techniques, for failures with significant impact it is worth the time and effort to conduct a careful RCA to prevent future occurrences. PinnacleART offers Root Cause Facilitation and RCCA planning as part of our suite of services. If you would like to speak to one of our Solutions Engineers about initiating an RCA or RCCA investigation and action plan, please contact us at firstname.lastname@example.org.
For additional information:
Root Cause Analysis Best Practices Guide (2014) Ronald Duphily, AEROSPACE REPORT NO. TOR-2014-02202: available on the web at http://www.aerospace.org/wp-content/uploads/2015/04/TOR-2014-02202-Root-Cause-Investigation-Best-Practices-Guide.pdf
Root Cause Analysis Handbook (2008) Donald K. Lorenzo, Laura O. Jackson et al., 3rd edition, ABS Consulting