The sample size requirements for process validation and design verification Regarding worst-case testing

hirohama naoki

Registered
In the well-known statistics textbook by Taylor, it is stated that in worst-case testing for design verification, a sample size of n = 1 may be acceptable. However, for Operational Qualification (OQ) in process validation, the recommended sample size is n = 15. Although both are considered challenge tests, why is there a difference in sample size?
 
Elsmar Forum Sponsor
Thanks @Miner

I no longer have access to Taylor’s book but the answer to the OP’s question lies in the words design verification and OQ (validation). Verification is blind to variation of manufacturing and is only testing that the design meets the intent, so the single unit is typically built to nominal. Validation is not blind to variation and must test the specifications under allowed variation (OQ).

While it is true that a single unit can be tested at the worst case conditions and the results are reliable as long as the worst case conditions include the unit itself. I think that 15 units for each OQ condition is overkill UNLESS you haven’t really determined the operating design space before hand. In which case it is probably too small. Remember that OQ is a validation that you have determined the correct specifications through solid experimentation. See my explanation for this process in the resources section. This use of a single unit for expensive parts/testing has been enshrined in many engineering standards for decades.

I typically prefer to use 3 parts for each OQ condition and for worst case testing. Unless the parts and/or testing are extremely expensive and limited then only 1 can be sufficient if the physics is well understood. See my brief explanation of why worst case testing is far better than random sampling.

Hope this helps. No doubt it will generate further questions.
 
I can't speak to the advice from Taylor, but as far as design verification goes:

If the specific design verification is formed as a (binomial attribute) hypothesis test where the null hypothesis H0 = 0.01 and the alternative hypothesis H1 = 0.95, it only requires a sample size of 1 to reject the null hypothesis with confidence = 0.95 and power = 0.90.

In practice: I've never met anyone who thinks about verification this way, but this is (I believe) the mathematical justification for certain types of verification where variability could (hypothetically) come into play. Another (cruder) way of looking at this N=1 sample size is "No $#!+, either the requirement will be obviously met or not."

I'm not going to touch the specific process validation number, except to remark that I believe that N=14 is minimum sample size where 0 failures of some binomial-output(*1) process would satisfy a process requirement of 95% confidence with 80% power... and in long ago practice I had a small number of encounters with non-math-minded 3rd party auditors that didn't flinch at certain process validations once N exceeded 11. <- In those cases the sample size was for PQ in medical device manufacturing; the terminology of OQ/PQ (and sometimes IQ) varies between industries; I often had a somewhat wide variety of process variables, so depending on the number of process variables and the process I might have needed more than 15 samples (for medical device OQs).

There is no substitute for critical thinking about what is trying to be demonstrated.

(*1) I mostly relied on binomial process outputs, but I want to say that the sample size of 15 for 95%/80% conveniently provides a 1-sided k-value of just under 1.5 and a 2-sided k-value of just under 2; which maybe was easy for people to understand/eyeball? There is nothing magical about 95%/80% except that for some places I've seen that to be a common first target when trying to improve (for the first time).
 
Last edited:
If the specific design verification is formed as a (binomial attribute) hypothesis test where the null hypothesis H0 = 0.01 and the alternative hypothesis H1 = 0.95, it only requires a sample size of 1 to reject the null hypothesis with confidence = 0.95 and power = 0.90.
Uh I’m not sure how this works. How would one calculate the standard deviation when there is no variation to input? What am I missing?

In general most of the n=1 tests dictated by some standards are based on worst case conditions and a ‘bloody obvious’ result: either it passes or it fails. There is no statistical magic involved. Only physics.
 
If the specific design verification is formed as a (binomial attribute) hypothesis test where the null hypothesis H0 = 0.01 and the alternative hypothesis H1 = 0.95, it only requires a sample size of 1 to reject the null hypothesis with confidence = 0.95 and power = 0.90.

Uh I’m not sure how this works. How would one calculate the standard deviation when there is no variation to input? What am I missing?

In general most of the n=1 tests dictated by some standards are based on worst case conditions and a ‘bloody obvious’ result: either it passes or it fails. There is no statistical magic involved.
It is the 'bloody obvious' verification of requirements... so if a requirement is something (speculative) like "System shall have an LED to indicate it is powered", you only need to check that the LED was added to the design. If there is another mathematical explanation for why N=1 for something like a 60601-1 type test, I'd like to see it.

How this math plays out in another context is with "stopping decisions" for things like treatments for medical conditions: if the difference in outcomes is huge it is possible to recognize that the treatment is doing what is needed without needing to collect all the data (from an original study design).
 
Well that is a complex web but I assume (I am not going to go probing around in there for sample sizes and test protocols) that it involves safety tests on a single device. This is typical for UL testing. The test protocols are usually quite specific and very aggressive (worst case). The other item of note is that this testing is done before manufacturing. This means the test proves the device is capable of meeting the test requirements for a passing result by design. It does not include any consideration of device variation such as that which might be introduced during manufacturing. Again this is typical for all industries with UL requirements and other safety type requirements. Usually the other concern with this type of testing is that it is destructive and expensive so given the physics of the test, it’s aggressive nature and it’s destruct nature a sample size of 1 is sufficient. No math/stats necessary; the justification is scientific and not statistical. No mathematical contortionism can justify this sample size; n=1 has no variation so no SD can be calculated. Hence there is no ‘confidence’ there is no probability, no defect rate, no reliability that can be calculated. None. And with this - like any verification ‘test’ - there is no implied or stated guarantee that a failure won’t occur in the future when expected and unexpected variation occurs. If only because there is a single unit that is usually best case or at least at nominal or created under engineering conditions and not manufacturing conditions. So this single unit verification testing is ‘backwards looking’.

Verification doesn’t test at the device extremes. Hence OQ and extreme conditions/extreme but allowable units. This is validation. Then we have some required reliability testing that takes into account wear over time/use. Then we have manufacturing controls and tests that ensure that the specifications are maintained.

‘Stopping decisions’ are different can of worms that also require critical thought…and deserve a different thread.
 
The original post asked a question about two different areas: design verification and process validation. The N=1 sample size (when allowed) for design verification doesn't care about "extreme conditions" because (as noted) there is no standard deviation that can be calculated to establish anything like a 1-sided limit (which presumably implies "requirement satisfied"). N=1 is accepted because there is some binomial attribute hypothesis test (that is satisfied). The binomial distribution is valid for N=1.

Process validation is about understanding the variation in the process of manufacturing instances of the (single) design (that was verified)... so without knowing what went into Taylor's (simply stated in the OP) N=15, we can only speculate(*). The same math is applicable to design verification of course, but the question was about process validation. There was this question:

Although both are considered challenge tests, why is there a difference in sample size?

With the answer here being (I believe): Sample sizes are different because there are different hypotheses being challenged in the two different examples (some design verification test v. some process validation test), so we probably could have left this as the answer.

(*) I'm now curious as to what went into Taylor's N=15 number. I'm guessing it is about some variable data, and I'm curious what sort of assumptions was baked into the result... or if it was an empirical observation!

EDIT: After some gurgling, I found this note (that I'll attribute to Taylor):

N=15 samples represent the fewest number of samples which should be used for a variable sampling plan for the following reason. Variable data must be tested for normality before analysis commences. 15 data points are required to perform a normality test. Therefore, at least 15 samples are required for the variable sampling plan.

The normality requirement is important (in the source document's presentation) because some of the examples related to sample sizes OQ and PQ acceptance criteria relate to tests against Pp and Ppk (and it looks like Minitab is recommended to get the actual challenge values). It is certainly convenient if the data is normally distributed (and is often a cause for concern if it is not) but the math for hypothesis testing doesn't strictly rely on a normal distribution... the source text does appear to offer the option to switch from variable to attribute studies.

The specifics hidden elsewhere in the same source are that the target is 90%/85% for an attribute sampling plan, and it is true that N=15 is the smallest N for which 0 failures supports 90%/85%. Below N=15 we can't hit 90%/85%. What I call Power is called Reliability in the source.

Most interesting to me: There are some assumptions made about looking for "two or more defects" used to justify a confidence level of only 90%, a table for 95% confidence is also provided that results in larger sample sizes, the sample sizes in Taylor reference for 95% confidence agree with my calculations (only sample sizes for c=0 failures are shown in the reference).
 
Last edited:
Back
Top Bottom