Author: Bob Latino CEO at Reliability Center, Inc.
ORIGINAL POST 5.15.17 (ID# Shaft.5.15.17)
This is a failed shaft that came out of a pump in a paper mill. The pump was only in service for about a month before it failed unexpectedly.
From the top view above, identify the type of failure pattern that you see from the fractured surface(s). If you need more info to make your assessment, just ask.
Below is a side view of the same failed shaft.
I am seeking a discussion on the physics of the failure based on the fractured surface.
FOLLOW UP POST 5.18.17 (Provided by Ron Hughes, Sr. Investigator, RCI)
Some facts about this failure involving the shaft shown above.
1. The key way was welded. In the past the pump broke numerous keys causing excessive downtime.
2. Maintenance installed a new key made of a harder material than specified.
3. Heat from the welding process changed the microscopic structure of the shaft and added additional weight which caused an unbalanced condition.
Facts about the pictures below.
1. The failure started at the lower corners of the key way.
2. There are 2 small fatigue planes at the initiation points.
3. The initial cracks were caused by Stress Corrosion Cracking from stresses internally induced during the welding process.
4. Not counting the case hardening of the shaft, there are 3 distinct grain structures in the shaft. These are caused again by the welding process causing the shaft material to change from varying stages of austenitic (Body-Center-Cubic) to a martensite (Face-Center-Cubic) i.e.; when you heat the material it gets harder with the addition of martensite.
5. There are Chevron marks in the case hardened surface depth area of the shaft. This is not unusual as these marks are left when very hard material breaks instantaneously.
6. The final fracture zone is very large indicating that the shaft was heavily loaded at the time of failure.
7. There is some torsion in the final fracture that is due to the unbalanced shaft. However, since the torsion is less than 45 degrees this is not a perfect torsional failure but rather rotating bending of the shaft.
So the physical cause of the failure is fatigue due to rotating bending and Stress Corrosion Cracking.
If we drilled deeper into the human and systemic issues by asking “Why would someone decide to make a weld repair to the key way?”, what potential answers can you think of to that question?
This has been a great exchange of experience by some very knowledgeable experts. Thank you.
If you’re interested, we have plenty of these types of fracture patterns from cases to discuss? We apparently can do so on this type of forum, just let us know if you would be interested in sharing your expertise. Take care folks.
FOLLOW UP POST 5.22.17 – Understanding the Human Contribution to the Physics of Failure
We clearly have a great deal of technical talent that responded to this post regarding the physics of failure (the hard side of failure). But now I wanted to dive into the human contributions to the failure (the soft side of failure).
This is often the difference between what people call ‘Root Cause Failure Analysis’ (RCFA) and ‘Root Cause Analysis’ (RCA). The term RCFA tends to limit itself to the hard side of failure and RCA is a broader term intended to pull in the Human and Systemic sides of the failure mechanisms.
When dealing with the hard side of failure (RCFA) we hypothesize by continually asking ‘How Can’ the previous hypothesis have occurred. We let the evidence answer the questions for us, as it will tell us which hypotheses were true and which were not.
For example’s sake, if we hypothesize as to ‘How can a shaft fail?’, we may come up with the possibilities of Overload, Fatigue, Erosion and Corrosion. From level to level, represents a cause-and-effect relationship in time.
I know we can think of many reasons a shaft can fail, but if we can visualize being the shaft at the time of the failure, we have to ask ‘what just happened to me’? This is where the physics of the failure is so important. The fractured surfaces tell the real story and it takes an educated eye to understand what those fracture patterns are telling us. We are simply going backwards and doing a visual reconstruction of the sequence of events.
The above is for example’s sake only, but you get the message. From here we would ask ‘How could we have a fatigue failure of the shaft?’. The questioning goes on and on, deeper and deeper, as the evidence itself leads the way. Whatever is true, we follow. What is not, we cross off as NOT TRUE.
As we drill down, eventually we will come across a human error (or several), which is simply a decision error. It will be either an error of omission or commission. We did something we shouldn’t have, or we should have done something and we didn’t.
In our hourglass slide above, we are discussing the human behavior related to our undesirable outcome. It is at this point we switch our deductive questioning (general to specific) to inductive questioning (specific to general).
We are now in the decision makers head, and have to try and understand his/her reasoning at the time and location of the decision. It is not for us to make judgments, we just have to put the decision into the proper context of what was going on at the time. Most of the time, when we truly understand the conditions, the decision seems perfectly logical. After all, most people don’t wake up in the morning and think to themselves, ‘How can I screw up at the plant today?’:-)
Getting back to our shaft failure and where I was guiding the discussion, what do you think was going through the mind of the maintenance personnel who welded the keyway?
In this case, let’s presume one of our Human Roots (HR) was the ‘Decision to Make Weld Repair on the Keyway of the Pump Shaft’. Our questioning reverts here from ‘How Could’ to ‘Why’. Why would the maintenance personnel have chosen to make such a weld repair on the keyway?
Some possibilities could be:
- There were no engineering Management of Change (MOC) requirements for the weld repair. In other words, there were not any guidelines for them to follow, so it was left up to their discretion. They did not violate any ‘rule’.
- There was a belief (paradigm) that a harder keyway will prevent the key from breaking. In their minds, this will also ensure increased uptime in the near term.
I am throwing this out for debate as there are additional human and systems considerations in these types of cases. Can you think of more?
Some food for thought, do you think this was the first time a keyway was welded to make such a repair? Could this have become a ‘practice’ that was acceptable, only until there was a high visibility failure? Could it be perceived that due to production pressures, such decisions are made hastily, despite the known failure risks? Do you think management was aware of these practices in the past?
What is your experience? What do you think could be going through the minds of those that were well-intentioned in this case, but their decision didn’t pan out as intended?
FOLLOW UP POST 6.8.17 (Provided by Bob Latino, RCI)
Thanks to Tim Lim for his REPLY on 6.7.17. As a result of his suggestions about the potential human contributions to this failure, I updated my logic tree in this case.
A ‘Human Root (HR)’ in this case was the actual decision to make the weld repair in the manner they did. So at this point we have to put ourselves in the position of the person making that decision and think ‘What was going through their mind’? WHY did they feel the manner in which they made the modification, was OK?
- No engineering Management of Change (MOC) was done for this specific weld repair
- A belief by the person making the modification that a harder keyway will prevent the key from breaking (and increase uptime)
- An adequate weld procedure did not exist.
As we drill down, we continue to ask ‘Why did the person making the modification believe that a harder keyway would work?’ Perhaps the person making this modification was not a qualified welder. As Tim Lim stated in his reply, “As a rule of thumb, any steel with a carbon content of more than 0.35% by heat analysis should not be welded. Key steel would have a typical value of 0.4C.” A qualified welder would have known this and if the conditions were not appropriate, they should have questioned ‘why not’ and ‘what is Plan B?’.
This leads us to understanding how we could have had a person who was not qualified, making such a modification. This is a managerial function. Supervisory personnel should be responsible and accountable for staffing all positions under their control with qualified personnel. Both knowledge and skill should have to be demonstrated in such positions prior to taking on responsibility for the position.
Moving on, ‘Why did a proper weld procedure not exist, since the failure had happened before?’ Previous RCA’s were either inadequate or non-existent. Had a proper RCA been conducted after previous failures, they would have identified this deficiency and corrected it.
If previous RCA’s were conducted, ‘Why were they not effective?’ Either the ones submitted were not scrutinized by management enough (properly validated) and/or the people conducting them were not qualified to be leading such analyses. Such system flaws are referred to as Latent Root Causes or LR’s (as labeled on the logic tree).
One fact that often goes unnoticed here is, ‘Do we think this is the first time the person making the modification, did it this way?’ That is something we should always ask ourselves when looking at a person’s reasoning. This person likely had done this very same thing in the past, in an effort to succumb to production pressure to get back online quickly. They likely got pats on the back and the proverbial ‘atta boy’ for getting production back up quickly. So naturally, if I get positive recognition for such behavior, I am likely to repeat it. Food for thought:-)
Thanks Tim Lim for helping me drive this point home about Human and Latent Root Causes in this case. The concept applies to all cases so think beyond the physics of the failure and what role did the human play in permitting the physical failures to occur?
Below are some additional resources:
PROACT Lead Investigator Training – https://www.reliability.com/lead-investigator.html (2 weeks)
PROACT RCA Methodology Training – https://www.reliability.com/root-cause-analysis-training.html (3 Days)
PROACT Investigation Management System, Software Solutions – https://www.reliability.com/software.html (Desktop, Enterprise and Online)