Sample size considerations in medical process qualification


Involved In Discussions
The FDA guidelines for manufacturing medical devices states that the customer risk is the key quantity. Therefore, during the qualification of our process we should focus our effort onto estimating/quantifying the customer risk, and to establish evidence that we satisfy our priorly defined limits. Also, the FDA "recommends" to state rationals for the used sample size during qualification. Nevertheless, I often read statements such as "we choose a sample size N=XXX, because this provides 95% confidence that the reliability R>=99%". The way I read this statement is that the author did the following: (in the example I assume a process, which generates results following a binomial distribution)
1. Define a value for the probability of failure, pFail = 1-R = 0.01,
2. Define a value for the max. allowed number of failure counts, e.g. nFail = 2,
3. Define a value of the confidence C = 1 - gamma = 0.95
4. and finally calculate the sample size such that the cumulated probability is equal to gamma = 1 - confidence = 0.05. So, the task is to find nSample such that Pr(n<=nFail | nSample, pFail) = gamma. In this example the calculation yields nSample = 628.
My key trouble using this method is that we are not quantifying the customer risk. I understand that this method is simple, but from a mathematical perspective we state the null hypothesis

H0: The reliability R is at least 99%

and then we collect data to show that we do not have enough evidence to reject H0. In statistics there is a famous statement, "the absence of evidence is not the evidence of absence". Nevertheless, this is exactly what we are implying.

My questions are
1. Am I the only one who believes that this method is wrong?
2. Is this method so common that auditors never complain about it? Does anybody had a bad experience with an auditor due to this?

I know that there exists methods, which use the customer risk as input parameter. So my question is not about other methods, but specifically about the usage of the describe method.

Bev D

Heretical Statistician
Super Moderator
<sigh>. So many things to address.
To directly answer your questions:
1. No you aren’t the only one, but you are in the minority
2. Yes unfortunately any auditors and reponsible individuals substitute popularity or frequency of use as proof of correctness.

Most of the time the accept/reject number is NOT specified ahead of the calculation of sample size except for those that subscribe to the misguided use of c=0 plans.

Risk is a vector word with 2 distinct and different meanings: the first is the severity of the effect (what is at risk) and secondly the frequency of occurrence of the failure/effect. (Frequently misstated as the probability of occurrence - they are not the same). In sample size calculations teh risk of interest is the frequency of occurence. (Severity is determined through logic and science and knowledge, not statistics)

The ‘risk’ is quantified by the defect rate that is not acceptable. (The RQL). Often people mistakenly use the AQL. In the Confidence reliability model, Reliability is 1-%defective.

Both the “AQL” and the confidence/reliability formulas are intended for inspection of lots from a process stream. From your post I think you are talking about validating that a process is capable of meeting a certain defect rate? This would be a completely different set of formulas.


Involved In Discussions
Thank you very much for your answer.

Although I would like to keep the focus onto my initial questions, let we comment on two points for your answer.

First, you are absolutely right. In general "risk" has two meanings. However, personally I use the terms "occurrence" and "severity" to refer to the two different input components to the pFMEA and "consumer risk" or "customer risk" to refer to the combination of the two (output of the pFMEA, as our quality department defines a max. failure rate for each combination of "occurrence" and "severity"). Thus, a consumer risk could be the demand to obtain a failure rate <=100ppm for a specific failure mode.

Second: Yes, AQL is commonly used to inspect lots. However, as we use the producer risk and the consumer risk in combination with the allowed(not-allowed failure count, it is superior to the described method in my question, as it uses the customer risk directly.


Trusted Information Resource
The FDA guidelines for manufacturing medical devices states that the customer risk is the key quantity.

Can I ask which FDA guidelines are being referred to here? There is a lot that could be written on a variety of topics relevant to medical devices ... just to name two different areas: User validation and effectiveness of risk controls in manufacturing.


Involved In Discussions
My intension is not to discuss one particular guideline, but to address the "general principle", while still keep the focus of the question. However, since you ask, e.g. in this book "The Medical Device Validation Handbook" (Editor: Max Sherman), which is an extended version of this guidance report, the customer risk is considered to be the key quantity. The book presents many formulas, but it does neither explain the general principle applicable to different distributions, nor states that the "simple method" presented in my original question is insufficient to quantify the customer risk. Thus, on the one hand side the book states that we must use statistical arguments to backup our qualification plans/reports. On the other hand side the statistical arguments are selected due to their simplicity. Does this help in addressing my original question?

PS: I noticed that I used 100ppm in my last post. Yes, this is not a standard customer risk for a binomial distribution.


Trusted Information Resource
Thus, on the one hand side the book states that we must use statistical arguments to backup our qualification plans/reports. On the other hand side the statistical arguments are selected due to their simplicity. Does this help in addressing my original question?
Bluntly: I don't think the original question is particularly clear. Since you linked to the GHTF guidance (NOT an FDA guidance, yet this doc is one I have linked to myself, in many posts here over the years), I will restrict my further discussion in this thread to PROCCES VALIDATION.

From that guidance, a perfectly good definition of Process validation: establishing by objective evidence that a process consistently produces a result or product meeting its predetermined requirements.

The question of statistical methods enter into the discussion via objective evidence, consistently, and meeting requirements. Briefly (and generally):

Objective evidence is derived from having a pre-approved study design, using data collected in an unbiased way. Yes there can be analysis of historical data, but there have to be safeguards in place to avoid cherry-picking data, following a "forking path" etc. It is most objective to establish the study design and then collect data.

Consistency requires (among other things) establishment, and continued use, of control mechanisms. Statistical methods will factor into these control plans as well.

Meeting requirements will not be simple binomial "i.e. pass/fail" tests. Folks can certainly reframe variable measurements as attribute studies, but attribute studies will generally be less informative than variable studies (simple example: process drift of a dimension will be missed in binomial attribute analysis, up until the point at which the attribute starts failing) and often drive up sample sizes for equivalent power.

Everything I wrote above is completely independent of RISK ANALYSIS. All of the above is JUST MATH. In the regulated medical device industry, there needs to be (1) a documented method of HOW a company is reducing risks, and (1) why they BELIEVE those methods are reducing risks.

Some examples of the enumerated points above:

(1) If a company recognizes that a manufacturing process (sterilization) is necessary to reduce the risks of biological contamination that will be documented in the risk management file. This will include an assessment of "how powerful" the evidence that this process works needs to be. This is the human assessment made based on clinical experience, business decisions, whatever. Then...

(2) The company will perform a process validation, and establish controls, commensurate with the assessment of necessary power that is motivated by the risk management file. Again: this is just mathematically driven based on whatever hypothesis testing is proposed... once it is determined how many type I and Type II errors will be accepted (see above), everything else falls out of the study design.

With respect to the math: I would prefer that binomial study designs be constructed such that the analysis is constructed with the goal of having a sample capable of rejecting the null hypothesis with an established power. It is of course possible to turn the study of a binomial attribute "backwards", but this ultimate just leads to confusion.

A classic example of the sort of confusion on this point of attribute testing is: Inspecting a part for a specific defect. the outputs of such an inspection are casually (and typically) said to be sorted into "bad" or "good" parts. This is NOT the case: the parts have been sorted into "rejected" or "not rejected" parts. The validation of this (here, a) test method, establishes the confidence of the sorting. If someone wants to call the "not rejected" parts "good", the question of "how good" is answered by the original study design. To bring this back to the (patient, user) risk analysis... that question (i.e. "how powerful does in need to be") was made by the humans in (1), not by the maths of (2).

Bev D

Heretical Statistician
Super Moderator
Directly and simply:
Question 1: no
Question 2: yes

anything else? Or is this a poll?


Trusted Information Resource
I'll add this bit about my experience with auditors. Like most people, they are "not good at maths"... unless they have been required (typically as part of a job) to use and maintain skills with a set of math skills, it is highly unlikely that during an audit specific questions will crop up. If anything, there might be "red-faced" questions asked, but even in those cases I don't expect that the auditor has spotted a mistake in math. I want to believe that professional auditors have some fundamentals of sampling plans, but from experience I never see third party auditors even using sampling basics. It's more like "Give me all the records since ____, I'll pick a few that I want to see in detail."

One example from my own past is the time that a third-party auditor wanted to know why a sample size was chosen. It fell to me to answer the auditor, because no one else could read the answer without their face turning red. The answer was: we didn't sample we checked 100% of the lot. Before I faced the auditor I was explicitly warned by my superiors that this was a serious question about sampling!

Personally: I do believe that it would take very little to generate red-faces by questioning the change in ordinal (groan) occurrence ratings as represented in risk management files that list "methods" as risk controls responsible for the reduction in those ordinal ratings. It is left as an exercise to the reader to make a sample size estimate for a method that claims to reduce the (pre-controls) occurrence of a failure mode from a "10^5 rating" to (post controls) "10^1 rating".


Involved In Discussions
@Tidge: If somebody writes in the qualification plan that the goal is to obtain 95% confidence with a reliability of 99% and in the report they show that they satisfied this goal, the formal requirements are met. As an auditor usually only sees these documents, it's understandable that the auditor has nothing to complain. After all, one of the major task of an auditor is to check that the organisation follows its internal quality goals. However, for the organisation it is a different subject. Usually, the organisation is interested in limiting the consumer risk to a certain level -- e.g. in my organisation we use the consumer risk to define the requirements. However, there is a common (mis-)conception is that the consumer risk is "well-controlled" using the two parameters (reliability, confidence). I personally have a hard time to argue why I want to use a different method to define the sample size for my OQ and PQ.

I know that there exists a one-to-one correspondence between (reliability, confidence) and the consumer risk for each assumed failure probability, because the operating characteristic curve of the binomial distribution is completely defined by two independent parameters. Therefore, if we choose one failure probability we are able to calculate the power of the OQ and PQ tests. However, without selecting a specific failure probability there exists no unique power. Commonly,this last step is omitted, and I feel that it should be mandatory, due to the fact that the costumer risk plays such a prominent role in many documents.

The point which you addressed in your last post is one I often forget. Therefore, my question would be: Do you believe that non-sampling errors are commonly dominating the error due to omitting the last step in the procedure described above?

@Bev D: No, this is not intended to be a poll. Although I am interested in the question "how often are auditors unhappy with the described method?", I am mainly interested in the experiences of those few who had troubles with auditors.

Bev D

Heretical Statistician
Super Moderator
Semoi - you seem to be arguing yourself in circles. This seems to be based on your stubborn adherence to false and/or misleading practices that you do not like and yet cannot break your thought process out of..

You responded to Tidge: “my question would be: Do you believe that non-sampling errors are commonly dominating the error due to omitting the last step in the procedure described above?” My answer is a very qualified ‘yes of course’ BUT the real answer is that the very methodology itself is the error. Get over it and move on. Simple adherence to the consumer’s risk will not improve the statistical assessment nor comply with the intent of OQ. For this confidence/reliability method. You can spout all the theoretical statistical rhetoric you want and it won’t change the fact that inspection statistics are not appropriate for OQ. Are you at all interested in knowing what is appropriate?

In response to me you said “No, this is not intended to be a poll. Although I am interested in the question "how often are auditors unhappy with the described method?", I am mainly interested in the experiences of those few who had troubles with auditors”. It doesn’t matter if a few (very few if any) people have had trouble with auditors over their statistical approaches. Auditors have a very limited job: they are there to enforce compliance to a minimum standard. They are not (in almost all cases) experts in statistics or the science involved in your organizations product. Misery may love company but the company you seek are imaginary.

Now Reviewers with the USDA or FDA (mostly in the pharma world) are experts in (theoretical) statistics and/or the science for which they review and they can and do impose specific statistical approaches that tend to avoid the inspection methodologies that are inappropriate for OQ, Qualification or verification/validation (these terms are used differently in different industries). However these individuals also tend to be stuck in outdated practices, methods and knowledge. (Salary, seniority and a govt pension are very sticky).

You should not use any approach because of an auditor; you should use an approach that is correct for its purpose because it’s the right thing to do. Current, recent and not that long ago history is rife with catastrophic examples of what happens when an organization meets some minimum standard instead of doing the right thing.

I will again state that the confidence/reliability approach is NOT appropriate for OQ. This is made worse by the convoluted approach that you have described. Even if you used the ”consumers risk” (the RQL or Rejectable Quality Level) for sample size determination, the formula is inspection based and not appropriate for OQ.

If you are interested in understanding better approaches (which will help you convince your organization of a better way) then we can continue this conversation. I will advise you that you will never change anyone’s mind simply by telling them they are wrong and that many people agree with you.
Top Bottom