Author: Robert J. Latino, CEO, Reliability Center, Inc.
Author’s Note: I want to reiterate that this Series about reading the basic fracture surfaces, is for novices who often come into contact with such failed components. This Series is about the basics (101), and is intended to give readers an appreciation for the value of such ‘broken’ parts to an effective investigation/RCA. While this information will be rudimentary to seasoned materials engineers/investigators, I know they will all appreciate heightening awareness to the need to retain such failed parts for formal analysis, versus throwing them away and just replacing the part. Throwing away failed parts is a recipe for a repeat failure. When one does not understand why the part failed in the first place, they can’t prevent it from failing again.
In this article we will focus on how to actually incorporate the evidence (failed parts) from a failure, into a disciplined Root Cause Analysis (RCA) process.
Summary Case Background: Let’s say we have suddenly incurred a coupling bolt failure on one of our critical pumps. As a result there is an unexpected production shut down. We have collected the failed coupling bolt(s), along with other evidence, and are going to examine the evidence (as we have already discussed in this Series). How does this all fit into conducting a proper, discipline and evidence-based Root Cause Analysis (RCA)?
Here is a top down and side view of one (1) of the coupling bolts.
Figure 1: Bolt front and side view
What are the FACTS when examining this bolt?
• Fasteners failed
• Washer imprint visible
• Washer imprint is not uniform around fastener
• Hinged lip can be seen
• ‘Salt & pepper’ is look present
FATIGUE: Let’s do a little refresher on our Fatigue blog about typical Fatigue characteristics:
• There is an ‘Origin’
•There are ‘Progression Marks’
•There is a ‘Final Fracture Zone’ (FFZ)
• There are ‘Ratchet Marks’
• There can be ‘Spalls’ (Hertzian Fatigue)
OVERLOAD: Likewise, let’s do a little refresher on our Overload blog about typical Overload characteristics:
• Brittle – material looks as though it can be put back together perfectly
• Ductile – material deformed
• Chevron Marks
• Salt & Pepper Look
• Hinged Lip
Which characteristics best fit the bolt shown in Figure 1?
- Brittle – material looks as though it can be put back together perfectly
- Salt & Pepper Look
- Hinged Lip
The bolt appears to meet the criteria for a brittle overload. The fracture surface color looks like salt & pepper, and it has a hinged lip.
So how does this play into conducting an effective RCA? We will go through the basics of reconstructing the failure using a Logic Tree.
We first start out with the facts of the case in the form of defining the Event and the Mode(s). Every level in a logic tree represents a cause-and-effect relationship. The Event is simply the ‘reason you care’. It’s the last ‘effect’ in the chain; the point where it was determined something had to be done.
Figure 2: The Logic Tree Top Box (Event + Mode[s])
In our case, the ‘reason we care’ is that we unexpectedly interrupted production. The factual observation/anomaly at the scene was the failed coupling. At this point we can see the failed coupling, but we don’t know how or why it failed. So we ask the question ‘How could the coupling have failed?’
Figure 3: Logic Tree ‘How Can?” Questioning
The numbers in the lower left hand boxes represent confidence factors. The scale is 0 to 5, where a ‘0’ means that with the evidence on hand, the hypothesis is false. Conversely, where there is a ‘5’, the hypothesis is without a doubt true. For each hypothesis there is a an entry into a Verification Log that shows the following:
- validation method used,
- the outcome of the validation method,
- who performed the validation
- when the validation was performed
For the sake of time, I will just verbalize the verification information (just know it is all in a single spreadsheet in reality). This is a very important step so I don’t want to trivialize it. Having the completed Verification Log is what makes your Logic Tree stand up!!
In this case the inspection revealed the coupling internals showed no anomalies, whereas two (2) of the coupling bolts where found in pieces. Therefore we follow the path of what is true. Now we ask ‘How could the coupling bolts have failed?’
Figure 4: Logic Tree – How Can Coupling Bolt Fail?
Going back to our Series articles about this, we remember the four (4) primary hypotheses for component failures; erosion, corrosion, fatigue and overload. The bolts are inspected by qualified materials engineers/metallurgist and the conclusion is the bolts were overloaded. So the other possibilities are crossed out as ‘not true’ and we follow the evidence. Our question now becomes, ‘How could we have overloaded the bolts, resulting in the unexpected outage?’
Figure 5: Logic Tree – How Can Bolt Overload Occur?
Our potential hypotheses are either 1) bolt related and/or 2) load related. For the load related side, we check for any signs of imbalance of the impeller and/or process load abnormalities at the time of the failure. We find none were evident. Since we have two pieces of each bolt, we know there is a bolt issue, so that is a fact. We continue to follow the evidence. Next question, ‘How could we have a bolt related overload?’
Figure 6: Logic Tree – How Can It Be Bolt Related?
Our potential hypotheses are an issue with torque and/or an issue with the bolts used/materials. In the case of being torque related, we checked the torque wrench, its value set, calibration and procedures, and all appeared proper. We then focused on the bolts themselves and found that different grades of bolts were used on the same coupling. Our question (as usual) was ‘How could improper bolts have been used on the coupling?’
Figure 7: Logic Tree – How Can Improper Bolts Have Been Used?
We hypothesize the bolts could have had an issue with the materials/metallurgy, there could have been a storeroom related problem and/or something related to the mechanic and the procedures they used. In this case it was confirmed the bolt materials met spec and the storeroom had an adequate supply of the proper bolts in stock. However, the interview with the mechanic confirms there were production pressures after the last failure that put time pressure on the mechanics to fix it quickly. As a result some shortcuts were taken. So we continue to follow the evidence.
Figure 8: Logic Tree – How Can it be Mechanic Related?
How could the mechanic have contributed to the coupling bolts being overloaded? We find they installed mixed bolt grades into the same coupling. Why would they do that? This is at a decision point where we must try to understand the reasoning of the decision maker at the time. They most likely did not want the failure to occur, but wanted to be a team player to help get production up and running. So why would they install different grade bolts?
Figure 9: Logic Tree – WHY Would the Mechanic Install Mixed Grade Bolts?
Either the mechanic was following procedures or he wasn’t. We know from the physical evidence that different grades of bolts were indeed used. A review of the procedure confirms the procedure in place is adequate. So the procedure was not followed.
When we get to the human decision-making points of the reconstruction effort, our questioning switches to ‘Why?’ instead of ‘How Can?’ We are now inside the mind of the decision-maker and looking for his rationale for the decision at that point in time.
Figure 10: Logic Tree – WHY Would the Mechanic Not Follow Procedure?
At the Human Root(s), this is where we start to switch from inductive to deductive thinking. We are interested in why a good person would think they were making the right decision, at the time.
In this case, ‘Why wouldn’t the mechanic follow procedure? We can hypothesize that he viewed the existing procedure as inadequate for some reason (he didn’t agree with it) and/or he felt time-pressured and employed a quick workaround.
Figure 11: Logic Tree – WHY Would the Mechanic Feel Time-Pressured?
So when we get into the mind of the decision-maker, why would they feel compelled to violate procedure? Oftentimes when we hear this happen, perceived production pressures are at play. We find ourselves in a position to either ‘do it right’ or ‘do it now’.
In this situation, the mechanic had all but two (2) of the same type of bolts that met spec. He had two (2) others bolts with him which were the wrong grades and were insufficient to accommodate the normal process load…but they were close. So do I hold up production until I get the right ones from stores (which could be hours or more), or do I put a ‘band aid’ fix in place? Who can relate to being put in this situation?
Also, when this type of situation occurs, we have to ask ourselves as investigators, ‘Do I think this is the first time this behavior has been practiced?’ I find the majority of time this is the ‘norm’. It is a practice that has evolved (deviated from the standard) over years. When we are time-pressured and take shortcuts…and nothing bad happens, that becomes the new norm (i.e. – normalization of deviance). Couple that with no negative consequences and we have a new lower standard (practice) in place.
As investigators we must also look at how such a behavior is permitted to become acceptable. Where is the management oversight? Do we think supervision knows these shortcuts take place? Sure they do. But when the goal of getting production back up quickly is attained, such violations often go unchallenged (and undocumented). Only when the shortcuts fail, are the decision-makers held accountable (and often disciplined…I believe unjustly).
Figure 12: Exploring the Human Element
So when we look at how a failure occurs with some type of undesirable outcome, we can see the investigation does not end with simply understanding the physics of the failure. From a human performance standpoint, the investigation starts at the decision-maker and delves deep into understanding the prevailing organizational system deficiencies, paradigms and cultural norms/influences.
In this case, how often do you think facilities would just make the repair to get production up and running, and ‘maybe’ even discipline the mechanic? Without exploring why people reason-out their decisions to uncover the latent influencers (bad systems, prevailing paradigms and cultural norms), the failure is likely to recur. This is because those deficient systems are still in play and other people will use them to make future decisions.
Simply chasing undesirable outcomes after they occur is purely reactive. Drilling an RCA down to the latent systems level, makes the RCA proactive in nature; because stopping short of that increases the risk of recurrence.
Again, this is a high level summary view of the entire process (as I see it). This is certainly a non-complex example, but the substance of the case was not the focal point, the process was the focal point. To see a more complex case study, please watch this video regarding a boiler feed pump failure.
Just as an FYI, I am pulling all of this material from our public workshops. These workshops range from the very basics to the interactive workshops for those who really want to get into this fascinating field of metallurgical analysis, RCA and Human Performance, to become Lead Investigators.
Please click the hyperlink if you’re interested in more job aides likes this and/or information on associated training and tools to help with understanding Why Parts Fail.
As always, I appreciate all of the great feedback and sharing from the veterans in the field, and equally as important, the front-lines who make things happen!! Thank you for your time and participation.