The sample size requirements for process validation and design verification Regarding worst-case testing

Bev D · Aug 21, 2025

Tidge said:
N=1 is accepted because there is some binomial attribute hypothesis test (that is satisfied). The binomial distribution is valid for N=1.

This is simply NOT TRUE. The BiNomial still requires a SD calculation. Period. Go look up the formula…with n-1 the BiNomial or the poisson or the negative geometric simply cannot be applied. Here’s an example: in many elementary science projects the student is directed by the teacher to grow a plant using 2 seeds. Both have water, fertilization and sun. One has praise and sing-song praise and encouragement. Hte other seed is given no encouragement. One of the seeds grows a plant that is taller than the other. What is the probability (p value)? It’s 50%. Or a p = .5. That’s straight probability not inferential statistics*. And that’s when you have 1 and 1. In the case of only a single sample you only have 1. How would YOU calculate a SD for this? Remember every SD calculation requires you to divide by n-1. So you have to divide by 0. This always results in infinity.

*Fisher himself talked about inferential statistics (using the SD) instead of straight probability because even tho probability was better it was incredibly difficult before computers…

Please be more specific about what you are trying to say. The OPs question was answered - do you actually disagree or agree? Start with your position then give your reasoning. Mostly you are only giving your reason but not your position.

…the rest is rationalization that just circles the drain of inferential enumerative statistics...validation sample sizes is a LARGE LONG discussion about the difference between attribute and continuous data, inferential statistics, acceptance sampling vs estimation and enumerative (descriptive statistics about only what is in front of you) vs analytic (predictive) statistics and has no place in this thread…if you wish to further discuss this point please start a new thread…

Tidge · Aug 21, 2025

Bev D said:
This is simply NOT TRUE. The BiNomial still requires a SD calculation. Period. Go look up the formula…with n-1 the BiNomial or the poisson or the negative geometric simply cannot be applied.

You don't need to worry about standard deviations in a binomial distribution, because it is possible to brute force calculate the smallest value for which the cumulative binomial distribution is greater than or equal to some critical value. You can start with N (number of trials) = 1, the p0, and the Confidence.. take that result use it as the required number of successes for the initial N, using p1 and get the cumulative binomial distribution. keep doing that as N is incremented and once (1.0 minus) that exceeds the Power you have found the minimum number of successes and the minimum number of trials to discriminate between p0 and p1 at the requested Confidence and desired Power.

If p0 << p1, it is possible to demonstrate that sample sizes of 1 (with 1 pass) are enough to satisfy some values of confidence and power... for example 1 in 1 trial is enough for 95%/95% to reject p0 = 0.01 in favor of p1 = 0.99, 2 of 2 is required for 99%/95% etc.. There is no "standard deviation" calculation involved.

It is true that if an analysis is relying on a some mathematical property, such as scale-invariance, some math breaks down (cannot give meaningful results) with N=0 (rarely N=1), but this analysis of a binomial attribute doesn't have "divide by zero" in it or some normalization factor that relies on (N-1).

As I immediately noted: once people start to believe "if I see this with my own eyes, this is obviously not that" they are not thinking "I believe the alternate hypothesis is near unity and the null hypothesis is near zero, at 95% confidence with 95% power" (for whatever observation), but I believe that is the mathematical description of what just happened with this sort of observation. For some verifications, this is "good enough"... but just saying "I went deep into worst case and did a test, N=1 is enough" is relying on a LOT of assumptions(*1). For example: we tolerate N=1 in a lot of software validation because the lack of inherent variation in code... but for some performance characteristics of code we might not rely on N=1.

(*1) Taylor is appropriately conservative regarding "worst case" testing and N=1: (some emphasis by me)

Rules for deciding if testing 1 unit is sufficient. This is for requirements where there is no variation in performing the test. One example is logic testing of software where each test is performed once because the same result will occur each time the test is performed. In software validation, the focus is identifying all the logic paths and testing each one once as part of a test script. This option and approach also applies to other features such as color and presence of a feature.

and later

Worst-case testing allowing 1-5 units to be tested at each worst-case setting. Worst-case conditions are the settings for the design outputs that cause the worst-case performance of the design inputs. When worst-case conditions can be identified and units can be precisely built at or modified to these worst-case conditions, a single unit may be tested at each of the worst-case conditions. This ensures the design functions over the entire specification range. This approach is generally preferable to testing a larger number of units toward the middle of the specification range. When units cannot be precisely built at or modified to these worst-case conditions, multiple units may have to be built and tested.

I emphasized the assumptions that go into "single unit may be tested", and even here the recommendation is "1 to 5 units" (so, "not really N=1 or N=14"?) Even so... here N refers to the unit and not the number of tests (greater than 1).

Bev D said:
Please be more specific about what you are trying to say. The OPs question was answered - do you actually disagree or agree?

Is this a question about how you (and I) initially answered the question? If so, I think we both (initially) went into trying to explain why the sample sizes might be what they were (i.e. N=1 or N=15), but that neither of us answered the actual question: "Although both are considered challenge tests, why is there a difference in sample size?"

If it is a question "Do I agree with the Taylor analysis for sample sizes?": Yes! I agree with him, and where the numbers apply, because he showed his math and I doubled checked it myself.

I believe the answer to OP's question is: "The sample sizes are different because different hypotheses are being challenged." Do you disagree with this?

Bev D · Aug 22, 2025

Yes you can derive a sample size of 1 by ‘forcing’ the binomial (Bernoulli trials) BUT every statistician worth their degree would tell you that this approach has almost no power (despite any mathematical gyrations) and is absolutely not statistically robust or reliable. It sinks to the level of a circus trick or parlor game. And you absolutely can’t do any ‘hypothesis test math’ on the results as there is no SD calculation. But the real point here is that this was never the justification for a sample size of 1 in these aggressive tests or design verification tests. I was there for the development of some of them and certainly have directly discussed many of them with the test developers and test owners (automotive, electrical and aerospace*): the use of n=1 is based on science and worst case conditions. To come up with a ‘statistical rationale’ is revisionist history and does nothing to engender confidence (real gut confidence, not the statistical ‘confidence) in any of these tests.

I will also say that 90/95 is horribly and laughably low for any high severity safety tests such as many of those being discussed.

*2 simple examples: every jet engine manufacturer performs a bird ingestion test to qualify their engines; they shoot a thawed frozen turkey into a running engine to determine that the engine can survive a bird strike. Every automotive assembler performs a destruct test on teh welded white body - a single white body - to qualify their welding process.

As for your last sentence: I agree that part of the answer is that they are different types of challenges. I said in my very first post: “the answer to the OP’s question lies in the words design verification and OQ (validation). Verification is blind to variation of manufacturing and is only testing that the design meets the intent, so the single unit is typically built to nominal. Validation is not blind to variation and must test the specifications under allowed variation (OQ).”.

We diverge when it comes to the justification or rationale for n=1 for design verification. …and teh validation sample size question itself is divergent from the OP’s question and is complex enough that it shoudl be discussed in a different thread. Feel free to start one if you want to continue that discussion.

I’m outa this one…

Tidge · Aug 22, 2025

Bev D said:
Yes you can derive a sample size of 1 by ‘forcing’ the binomial (Bernoulli trials) BUT every statistician worth their degree would tell you that this approach has almost no power (despite any mathematical gyrations) and is absolutely not statistically robust or reliable. It sinks to the level of a circus trick or parlor game.

Hard disagree when it comes to certain verification activities, as does Wayne Taylor. N=1 can be a perfectly valid result for hypothesis tests, and not just because of "logic". If a statistician can't use math to explain why N=1 can be a valid sample size for some testing, they either have to believe that N=1 is wrong (and 99% of medical electrical devices on the market were improperly verified) or they must recognize that their is a gap in their understanding.

In this example:

Bev D said:
Here’s an example: in many elementary science projects the student is directed by the teacher to grow a plant using 2 seeds. Both have water, fertilization and sun. One has praise and sing-song praise and encouragement. Hte other seed is given no encouragement. One of the seeds grows a plant that is taller than the other. What is the probability (p value)?

As stated this isn't a hypothesis that meets the p0 << p1 condition I described.

If the p0/null hypothesis was "seeds won't sprout without encouragement" (p0 sprouting without encouragement almost 0) and the alternative was (seeds will sprout without encouragement p1 >> p0)... we only have to observe 1 seed to sprout without encouragement to reject the null hypothesis. This is akin to a design verification of a requirement of "was the warning placed on the label?"... we do a label review and see the warning, we are done with verification of that requirement.

If the study design is meant to verify that "with encouragement, seeds sprout into taller plants", and we settle on something like p0 of 50% (50-50 chance of encouragement having an effect) and claim the effect of encouragement leads to a p1 55% chance of taller plants when given encouragement, a 95%/90% hypothesis test requires a sample size of over 860 to reject p0=50%. Smaller effects (p1 closer to p0) require larger sample sizes.(*) This is what I *think* your example is trying to prove to me: that N =1 doesn't work for these types of requirements... but this was never in dispute. None of this is parlor tricks, it is math.

Bev D said:
We diverge when it comes to the justification or rationale for n=1 for design verification. …and teh validation sample size question itself is divergent from the OP’s question and is complex enough that it shoudl be discussed in a different thread.

The OP is pretty obviously a "test question" that was seeking an answer. I'm surprised we weren't given the multiple choice answers, so it was probably an interview essay question.

As for the specific justification for N=1, I prefer Wayne Taylor's use of the word "logic" as apposed to using the word "science" to justify N=1, because science is done all the time to establish limits on measurements where N=0, even if the thing being sought does exist (Higgs, gravity waves) or has not yet been observed to exist (proton decay, magnetic monopoles). This is not verification of established design requirements, but neither are crops and crop circles.

(*) To close the loop on one other specific sample size suggestion from Wayne Taylor, and provide context the N=5 for number of units for worst case testing. N=5 is chosen because the underlying assumption about the DUT: "was the worst case unit really built to be precisely worst case (enough)?" is chosen to have a prior equivalent to a coin toss (p0=0.5), so N=5 "guarantees" (with high confidence/power) that we have a worst case build among the N=5. There isn't any statistical power in the N=5 w.r.t. to the requirement. Of course if the requirement is not passed it might be possible to do some analysis for attributable cause, but any of the N=5 failing means the device didn't pass the established "worst case" testing... becuase the 'logic' was that a DUT that passes testing at worst case meets some requirement, not that c failures from 5 would be acceptable to pass the requirement.

The sample size requirements for process validation and design verification Regarding worst-case testing

Bev D

Heretical Statistician

Tidge

Bev D

Heretical Statistician

Tidge

Similar threads