Root Cause and People Question - Understanding CAPA's and Root Cause Analysis

  • Thread starter Thread starter Willyboy
  • Start date Start date
Everyone,

Thanks! I'm reading this before I go to work today. Will try to keep everything in mind.

Jane, I've already had a discussion with the floor manager who felt that I was trying to tell her how to do her job. I'm not. I do not know how to do her job.

I did read some of Cochran's writing. Good stuff.

Even though yesterday's meeting was interesting, boggling and challenging at times, overall, I believe that there was some good , honest communication. That's a good step. Some unbiased outside instruction and learning for everyone on problem solving would be helpful.

In everyone's opinion, what is the most beneficial size of a group when discussing CA's and root causes? The floor manager and QA have a weekly afternoon meeting with about 12-14 staff in attendance. My thought was that if the groups were smaller, maybe 4-5, including the individuals most closely related to the issue/incident, then maybe a better discussion would be had. Definitely not as much side talking, distractions, etc.

Thoughts?

Once again, thanks!!!!
 
The great Deming, as well as a clear majority (~85%) of the Cove folks who voted in a poll https://elsmar.com/elsmarqualityforum/posts/339889/ that has ~ 200 votes over 7 years, believe that sometimes it is NOT the system that is the root cause. I'm one of those folks. Basically I believe that roughly 1 time in 20 it might be the "latent doofus" :lol: and not the system at fault.

One should never say never. :notme:
 
In everyone's opinion, what is the most beneficial size of a group when discussing CA's and root causes? The floor manager and QA have a weekly afternoon meeting with about 12-14 staff in attendance. My thought was that if the groups were smaller, maybe 4-5, including the individuals most closely related to the issue/incident, then maybe a better discussion would be had. Definitely not as much side talking, distractions, etc.

Thoughts?
My experience has been that between 5 and 7 works the best. Much less and you usually do not have enough input and more than 7 degenerates into side conversations etc.
 
Jim Wynne
I recall that in a similar thread not long ago, I suggested to someone (who maintained that management is always responsible for errors made by workers) that if he had a method for hiring only people who never make mistakes, he should share it with us.
Since it is I to whom Jim is referring, I will try to clarify my position while answering your questions.

Willyboy
…, wouldn't knowing who is the recurring cause of damaged goods be important? What about the metrics of this issue?
Who is having repeated CARs for not securing their loads, Billy Bob or Suzie Que?
It can be valuable, but like all metrics, half are above average while half are below average, one is the minimum while one is the maximum. Before praise or reprimands are given, you must be sure that the data point you are looking at is outside what may be predicted. Many views (supervision, quality, HR, ...) should be represented to assure fairness. Make sure that your CAPA system maintains all of those who were potentially involved for Q&A later. It will be valuable to determine if the non-conforming operation was “Human error”, a “System error”, “bad luck”, or the system increased the possibility of “Human error”.

But first, let’s look at the system and “Human error”. How many occurrences of the drivers not securing their loads causing damaged goods per year do you expect and/or wish to tolerate?

How can we evaluate this? [See the attached Ver. 4, pFMEA Quick Reference Guide]
We can review the shipping or loading function of your pFMEA. You have identified a failure mode of “goods damaged in transit”. The Potential Cause is “driver fails to secure their load before leaving the dock”. What is the Potential Effect? Can the load shift during transit causing instability and rolling the truck [Severity-10]? Or, is it less severe? The load is damaged and has to be inspected / rejected causing line disruption at the customer [S=8]? The load is maybe damaged and has to be inspected [S=6], or is found to be damaged and requires some reworked [S=7]?

If the Severity is [10], then before preceding the team must take action to eliminate the possibility of the load shifting or ensure that the driver cannot leave the dock before securing their load [<1 occurrence in 10,000,000 opportunities].

If the Severity is a [6], [7], or [8] there really is not much difference, because action must be taken if Occurrence is a [4 or higher] for all of those severities. Since I do not have a good history, let’s see what an occurrence of [4] means. A [4] says that we will have <1 occurrence in 10,000 opportunities. That means:

10,000 [opp/occ] / 240 [workdays/year] = 41.67 [opp-years/occ-workdays]
41.67 [opp-years/occ-workdays] / X [# of opportunities/workday] = Y year(s) per occurrence

So, if you load 3 trucks per day that the drivers need to secure before leaving the dock, then

41.67 [opp-years/occ-workdays] / 3 [# of opportunities/workday] = 13.9 year(s) per occurrence

This means that if the team believes that a driver will fail to secure their load before leaving the dock more than once in the next 13.9 years, then management MUST be advised of this risk and mitigation action(s) continued unless advised by management that they will accept this high risk.
You now have a “Human error” vs. a “System (let me down) failure” rate. You can use “Human error” and retaining on a CA every 13.9 years – not every time it happens!

Note for Jim: Sometimes humans need to be protected from themselves instead of just being perfect.

Let’s look at another scenario. Your company loads 300 trucks per day that the drivers need to secure before leaving the dock, then

10,000/240/300 = 0.139 year(s) per occurrence = 33.3 workdays per occurrence

Now (for this scenario), management does not to wish to tolerate this rate and asks the team to continue to take action. The team must now look at other methods to reduce the occurrence rate. So, are there ways to restage the load to make it unbelievably obvious that it is not secured?

If not, let us examine Detection methods. Currently, you have the driver sign off a check-list. Unfortunately, this process has a remote likelihood [detection = 8] of detecting the failure mode by the driver through a visual / memory means. Remember a human’s propensity to miss items that they have already missed means they will “make a ‘honest’ mistake” in checking the box.

Can we have multiple people check the load for proper securing? Multiple inspections improve the chance of detection but it is not considered a great method. What about having the driver using a variable gage to measure the tension of the securing lines and recording their values on a check-sheet [D=6]? Why not combine the two ideas and have the non-driver inspector record and sign the check-sheet. The driver can then use this as permission to “leaving the dock” [D=4].
These check-sheets can then be audited for process control.
[For further info see] Use Benford's Law with Excel by Charley Kyd
Benford's Law addresses an amazing characteristic of data. Not only does his formula help to identify fraud, it could help you to improve your budgets and forecasts.
https://www.exceluser.com/tools/benford_xl11.htm
This method will also leave a document falsification trail if deemed necessary later.

As for documentation on the pFMEA form you can refer to your CA or PA form issue number where your more detailed thinking can be documented. The 7th step of your 8D CAPA system is then entered into the Actions Taken column of the pFMEA.

Similar methods can be made for “boxes in the walkways”. If the boxes are empty and will be used to place product into, then why were batch produced, over-production, larger than KanBan station number of boxes produced? If the boxes are filled with product, then why is production continuing beyond the pull effect of the EOL KanBan? Is your company forcing the associate to error?

I hope this helps in your thinking and approach towards improvement.

Your management may wish to read Deming’s books. I sat through many of his weeklong training sessions and witnessed the “wrath” of Deming. He would point to the executives and ask them why they did not believe in quality (& lean) and why they blamed their workers.

Joe Adams
Process Excellence Architect
** Deleted company name **
 

Attachments

Last edited by a moderator:
Willyboy

It's important (in my opinion) for an internal auditor to always keep in mind that you're auditing (or should be auditing) against your own system. So I'd be going back to whatever your system says and using that as the benchmark. So if, for example, your system says something like 'instructins will be pulled and used' and they aren't being - then you report that as an issue. If your system doesn't say what it should (eg, doesn't fully meet the requirements of ISO 9001) then that's a weakness, and also needs addressing, but is a slightly different issue.

Also, in audits I aim to answer the 'so what' question - ie, if we aren't following our system, so what? ie, what's the impact or consequence to us? If it's a risk, say so. If 'bad things' are happening because of it, state your evidence for that. Sounds like you've got that!

Then, the CAPA thing takes over - and the manager responsible has the primary responsibility of fixing the problem. (It's therefore important to issue a clear CA, and issue it to the right person!).

If you blur the lines - ie, a) I found this problem and b) I think a good way of addressing it would be... the a) part IS your responsibility as an auditor but at b) you are stepping outside the boundaries - and managers can and will get peeved. Tread delicately and carefully.

If good, honest communication is occurring, that's excellent - you're moving in the right direction. One where everyone gets focussed on the problem itself (not turf wars). But turf wars will/may happen if lines are blurred and/or lack of clear direction and support from top management.

In my view, around 6 - 8 people or so (max) - more than that, forget it, it isn't a discussion. It's either a lecture or a gabfest. Keep in mind, the process of learning to do this is at least as important as the problem solving; better to go slow and get better than try to go to fast and turn people off.
 
It can be valuable, but like all metrics, half are above average while half are below average, one is the minimum while one is the maximum.
No, if half are above and half are below, it's the median, not (necessarily) the average.

Before praise or reprimands are given, you must be sure that the data point you are looking at is outside what may be predicted. Many views (supervision, quality, HR, ...) should be represented to assure fairness. Make sure that your CAPA system maintains all of those who were potentially involved for Q&A later. It will be valuable to determine if the non-conforming operation was “Human error”, a “System error”, “bad luck”, or the system increased the possibility of “Human error”.
Whether or not something can be predicted is mostly irrelevant. I understand the value of understanding probability and using the understanding to support decisions, but we shouldn't assume that everything bad that might happen is or was predictable. Furthermore, even when we can predict that something untoward might happen, we might also understand that whatever measures might be necessary to prevent it aren't worth doing. If, for example, we have good reason to believe that a process will create one defective unit in every 10,000 opportunities, we may decide that what's necessary to prevent it is economically unfeasible.

But first, let’s look at the system and “Human error”. How many occurrences of the drivers not securing their loads causing damaged goods per year do you expect and/or wish to tolerate?
I "wish to tolerate" none, but that doesn't mean much. If I've done all I can to prevent it from happening, and it still happens, whether I "tolerate" it or not doesn't mean anything. This idea is of a piece with those execrable things that talk about "If 99% is 'good enough' 400 babies will be dropped on their heads every year." It's not a question of whether some level of defects is good or bad; the idea is that sometimes things happen which (a) couldn't be reasonably predicted or (b) could be predicted but we can't do anything to stop them from happening, either because of the laws of physics or because to stop them from happening wouldn't be economically responsible. No one wants to see babies dropped on their heads, and we should do everything we can to prevent it from happening, but we shouldn't be surprised that when we've done everything we can, babies still get dropped. Nor should we think that because we understand this, that we're "accepting" a certain number of dropped babies.

How can we evaluate this? [See the attached Ver. 4, pFMEA Quick Reference Guide]
We can review the shipping or loading function of your pFMEA. You have identified a failure mode of “goods damaged in transit”. The Potential Cause is “driver fails to secure their load before leaving the dock”. What is the Potential Effect? Can the load shift during transit causing instability and rolling the truck [Severity-10]? Or, is it less severe? The load is damaged and has to be inspected / rejected causing line disruption at the customer [S=8]? The load is maybe damaged and has to be inspected [S=6], or is found to be damaged and requires some reworked [S=7]?

If the Severity is [10], then before preceding the team must take action to eliminate the possibility of the load shifting or ensure that the driver cannot leave the dock before securing their load [<1 occurrence in 10,000,000 opportunities].
If the severity is 10, we should do everything we can to prevent the bad thing from happening, but if you think that we can always eliminate the possibility in every case, you're dreaming. Reduce the probability to the lowest responsible level. Telling people that they always must eliminate or decidedly prevent things from happening in every case is irresponsible.

If the Severity is a [6], [7], or [8] there really is not much difference, because action must be taken if Occurrence is a [4 or higher] for all of those severities. Since I do not have a good history, let’s see what an occurrence of [4] means. A [4] says that we will have <1 occurrence in 10,000 opportunities. That means:

10,000 [opp/occ] / 240 [workdays/year] = 41.67 [opp-years/occ-workdays]
41.67 [opp-years/occ-workdays] / X [# of opportunities/workday] = Y year(s) per occurrence

So, if you load 3 trucks per day that the drivers need to secure before leaving the dock, then

41.67 [opp-years/occ-workdays] / 3 [# of opportunities/workday] = 13.9 year(s) per occurrence

This means that if the team believes that a driver will fail to secure their load before leaving the dock more than once in the next 13.9 years, then management MUST be advised of this risk and mitigation action(s) continued unless advised by management that they will accept this high risk.
You now have a “Human error” vs. a “System (let me down) failure” rate. You can use “Human error” and retaining on a CA every 13.9 years – not every time it happens!
All of this assumes that failure rates are constant, which is a fact not in evidence. While I heartily agree that we should use the best data we have to make predictions, sometimes the best data we have isn't reliable in terms of making predictions.

Note for Jim: Sometimes humans need to be protected from themselves instead of just being perfect.
This is true, but trivially so. The idea is to reduce the effects of potential human errors, not to eliminate human error, which is impossible so long as humans can affect the outcome. Not only that, but sometimes preventive measures themselves are subject to human error or the inability to predict all possible failure modes. We might have a method of securing truckloads (or transporting babies) that works 99.9% of the time, but due to experience we believe that it will always work. Just because an event is highly improbable, though, doesn't mean that there must be 10,000 (or 1,000,000) opportunities for it to occur. It might happen this afternoon, or it might never happen. That's the crazy thing about randomness.

Let’s look at another scenario. Your company loads 300 trucks per day that the drivers need to secure before leaving the dock, then

10,000/240/300 = 0.139 year(s) per occurrence = 33.3 workdays per occurrence

Now (for this scenario), management does not to wish to tolerate this rate and asks the team to continue to take action. The team must now look at other methods to reduce the occurrence rate. So, are there ways to restage the load to make it unbelievably obvious that it is not secured?

If not, let us examine Detection methods. Currently, you have the driver sign off a check-list. Unfortunately, this process has a remote likelihood [detection = 8] of detecting the failure mode by the driver through a visual / memory means. Remember a human’s propensity to miss items that they have already missed means they will “make a ‘honest’ mistake” in checking the box.

Can we have multiple people check the load for proper securing? Multiple inspections improve the chance of detection but it is not considered a great method. What about having the driver using a variable gage to measure the tension of the securing lines and recording their values on a check-sheet [D=6]? Why not combine the two ideas and have the non-driver inspector record and sign the check-sheet. The driver can then use this as permission to “leaving the dock” [D=4].
These check-sheets can then be audited for process control.
[For further info see] Use Benford's Law with Excel by Charley Kyd
Benford's Law addresses an amazing characteristic of data. Not only does his formula help to identify fraud, it could help you to improve your budgets and forecasts.
https://www.exceluser.com/tools/benford_xl11.htm
This method will also leave a document falsification trail if deemed necessary later.

As for documentation on the pFMEA form you can refer to your CA or PA form issue number where your more detailed thinking can be documented. The 7th step of your 8D CAPA system is then entered into the Actions Taken column of the pFMEA.

Similar methods can be made for “boxes in the walkways”. If the boxes are empty and will be used to place product into, then why were batch produced, over-production, larger than KanBan station number of boxes produced? If the boxes are filled with product, then why is production continuing beyond the pull effect of the EOL KanBan? Is your company forcing the associate to error?
Although this might be a case of an inappropriate hypothetical being used, this all seems like massive over-complication to me. A properly-secured load might get improperly unsecured after the truck leaves the dock, which is a common occurrence when using common carriers and something the sender is mostly powerless to prevent. You can have thousands of dollars worth of safety equipment, all of which is useless if a human decides to ignore or override it.

I hope this helps in your thinking and approach towards improvement.

Your management may wish to read Deming’s books. I sat through many of his weeklong training sessions and witnessed the “wrath” of Deming. He would point to the executives and ask them why they did not believe in quality (& lean) and why they blamed their workers.

Many executive horses may be led to Deming's water and still not drink, I'm afraid. This is not to say that the old master's teachings aren't important and valuable, or that we shouldn't make a reasonable attempt to enlighten people. The fact is, though, that humans are massively complex beings and no one has yet found a way to keep them from shooting themselves in the foot, and doing so repeatedly in some cases. We need to devote our best efforts to reasonably reducing the chance for human error to negatively affect outcomes, while understanding that our best efforts will never keep humans from being human.
 
Last edited:
Originally Posted by Jim Wynne
If, for example, we have good reason to believe that a process will create one defective unit in every 10,000 opportunities, we may decide that what's necessary to prevent it is economically unfeasible.
Absolutely True (as I pointed out), Management has the right to make that decision (but not the pFMEA team – management selected the Severity, Occurrence and Detection rates in their quality policy document that the pFMEA team were to use).

Originally Posted by Jim Wynne
I "wish to tolerate" none, but that doesn't mean much. If I've done all I can to prevent it from happening, and it still happens, whether I "tolerate" it or not doesn't mean anything.

… or because to stop them from happening wouldn't be economically responsible.

No one wants to see babies dropped on their heads, and we should do everything we can to prevent it from happening, but we shouldn't be surprised that when we've done everything we can, babies still get dropped. Nor should we think that because we understand this, that we're "accepting" a certain number of dropped babies.
A) No one wants to see babies dropped / we should do everything we can to prevent it
B) … to stop them from happening wouldn't be economically responsible

…it still happens, whether I "tolerate" it or not doesn't mean anything

It means everything. Management is willing to go to production knowing it must tolerate the risk of its processes including the processing costs constrains, because it is better than the alternatives. But as you said, ‘no one "wishes to tolerate" any’. Therefore, management MUST start a CPI team to search for methods that are economically responsible to implement for, if they do not, they are just willing tolerate them. (The decision line is ultimately drawn by loss of business or business growth and the lack of or outcome of law suites.)

Originally Posted by Jim Wynne
If the severity is 10, we should do everything we can to prevent the bad thing from happening, but if you think that we can always eliminate the possibility in every case, you're dreaming. Reduce the probability to the lowest responsible level. Telling people that they always must eliminate or decidedly prevent things from happening in every case is irresponsible.
Dreaming – maybe. What risk am I willing to tolerate? See above.

Originally Posted by Jim Wynne
We might have a method of securing truckloads (or transporting babies) that works 99.9% of the time, but due to experience we believe that it will always work. Just because an event is highly improbable, though, doesn't mean that there must be 10,000 (or 1,000,000) opportunities for it to occur. It might happen this afternoon, or it might never happen. That's the crazy thing about randomness.
You are correct. But, how will you and your company respond when your logistics carrier damages two of your shipments within a two-week period and sends you CAR responses stating that it was “human error” and they have retrained the driver. Would you and your company accept (tolerate) the responses or would you and your company be unreasonable and be on the phone getting a new shipping company? See above.

Originally Posted by Jim Wynne
Although this might be a case of an inappropriate hypothetical being used, this all seems like massive over-complication to me. A properly-secured load might get improperly unsecured after the truck leaves the dock, which is a common occurrence when using common carriers and something the sender is mostly powerless to prevent. You can have thousands of dollars worth of safety equipment, all of which are useless unless if a human decides to ignore or override them.
“… when using common carriers …” – If the load is on a common carrier then the shipping function belongs to the carrier, and therefore, as you point out the sender cannot control that carrier’s risk assessment. But, they can review their pFMEA and Control Plans and, as we saw above, they can control which carrier that they use.

“…if a human decides to ignore or override them.” – then it is not “Human error” or a tolerated risk. It is sabotage. You fire them and seek legal remedies.

Originally Posted by Jim Wynne
Many executive horses may be led to Deming's water and still not drink, I'm afraid. This is not to say that the old master's teachings aren't important and valuable, or that we shouldn't make a reasonable attempt to enlighten people. The fact is, though, that humans are massively complex beings and no one has yet found a way to keep them from shooting themselves in the foot, and doing so repeatedly in some cases. We need to devote our best efforts to reasonably reducing the chance for human error to negatively affect outcomes, while understanding that our best efforts will never keep humans from being human.
Absolutely, well stated. Humans will always continue to show us new failures and new causes. How we respond to the opportunity keeps the journey interesting.

I threw in the Deming comment to leave Willyboy with some positives instead of being totally blunt. He needs to be looking for a new job. Based upon the symptoms he listed, his company is being run by “unenlightened” management. They do not understand quality or lean principles – nor are they seeking them. As a result their customers will not be able to accept their high cost, poor delivery, non-conforming parts and go to another supplier. Or to lower cost and hope for better performance with the same old processes, they will close the plant and move it to China or Vietnam. Either way, he will eventually be out of a job.

However, if Willyboy is willing to try to enlighten his management, I have seen many posts in the Elsmar Cove on some innovative techniques.

Joe Adams
Process Excellence Architect
** Company name removed **
 
Last edited by a moderator:
Joe Adams
Process Excellence Architect
I'm not going to reply to your entire response because it's clear that we don't agree. I will make a friendly suggestion, however. You might want to reconsider your title, lest it be reduced to an unbecoming acronym. :D
 
I'm not going to reply to your entire response because it's clear that we don't agree.
I changed my mind. :D

First, you seem to think that the AIAG RPN ratings are universally recognized and used in the PFMEA process. They're not--not even in the automotive industry. They are guidelines, nothing more, and it's often the case that some other frame of reference makes more sense. If we choose a scale that ranges from 1 (severity is negligible) to 10 (all hell breaks loose) we can accomplish the task at hand in most instances without resorting to a lot of needless arithmetic.

With regard to determining the economic feasibility of corrective action, you said:
Management has the right to make that decision (but not the pFMEA team – management selected the Severity, Occurrence and Detection rates in their quality policy document that the pFMEA team were to use).
Neither you nor I may dictate what management chooses to do in these situations. Management might rationally decide to allow experienced people to do their jobs with a minimum of interference, including delegating the authority to determine the proper RPN scale to use. If a person is assigned by management to conduct a PFMEA, it should be safe to assume that the person is familiar with both the PFMEA process and the general principles of risk management.

With regard to people making mistakes, I said, "You can have thousands of dollars worth of safety equipment, all of which are useless unless if a human decides to ignore or override them," to which you responded:
if a human decides to ignore or override them.” – then it is not “Human error” or a tolerated risk. It is sabotage. You fire them and seek legal remedies.
(Emphasis in the original)

A lapse in judgment is a type of human error. "Sabotage" is rather strong, it seems, and suggests a deliberate effort to make something bad happen. The thing about human error that you don't seem to understand is that while a person's mistake (whether an error of omission or commission) might lead to something bad happening, there's a bright side, or should be. You can take one of two positions: you can believe that human error is anathema and go around firing people when it happens, or you can understand that mistakes lead to improvement. I think of something attributed to Thomas Edison in this regard: "I have not failed; I've just found 10,000 things that won't work." For an individual, knowing from experience what won't work is the very best type of training.

Of course we don't want people to learn about gravity by dropping heavy things on their coworkers' heads, but when honest mistakes--including mistakes in judgment--occur, they should be used to make things better.

Finally, in this same vein, I return to your original reference to Deming:
Your management may wish to read Deming’s books. I sat through many of his weeklong training sessions and witnessed the “wrath” of Deming. He would point to the executives and ask them why they did not believe in quality (& lean) and why they blamed their workers.
Deming also made a point (one of fourteen) of telling management to "drive out fear." Making people fearful of making mistakes or holding the belief that humans don't make them doesn't help in that regard.
 
Back
Top Bottom