Sample size selection for process validation - continuous data

ASchmitz93

Registered
This post is closely related to several others in this forum... however I wanted to start a fresh thread to ask a (hopefully) very targeted question. I have indeed reviewed quite a few other threads here to try and find the answer to my exact question... but haven't seen quite what I'm looking for. Please help :)

I'm validating a material manufacturing process where my material strength needs to be shown to be consistently and reliably above a required value, let's call that value X. My company has a ton of historical experience and data with this type of manufacturing process, to the point where I know/expect the process is to be VERY capable. For example, historical data shows sample average ~15% above X and no single value even within 12% of X... Ppk ~1.8 (sorry couldn't figure out how to insert a picture of actual historical capability analysis).

I would like to use this experience, historical data, and knowledge our company has to be more aggressive with sample sizes (lower them) going forward for process validations... but my boss is stuck on "we need to target a sample size of 59... this is based on a sample size for tolerance intervals at 95% reliability with 95% confidence... it's based on the mil-std / mil-hbk... it's the industry norm, it's what auditors expect..." - blah blah... you all seem very knowledgeable and probably know this opinion all too well.

My issue with his opinion is that when our process is so capable (tight spread and average so far above the requirement) this much destructive testing is wasteful... I could probably test 20 samples and run a capability analysis and it'd show the process is sufficiently capable.

My targeted question for all you smart people:
  • Am I on the right track by trying to justify a lower sample size by utilizing process capability? Do I just need to push harder on my boss? Or is there another statistical approach to utilize which could justify the lower sample sizes? ... I'm not in the wrong am I?
A side question I have:
  • Which mil-std / mil-hbk is being referenced here? 105 - SAMPLING PROCEDURES AND TABLES FOR INSPECTION BY ATTRIBUTES?
  • I want to really dig into where this industry norm has come from... where do I need to look? What standard or best practice will show me why the industry has trended towards these confidence and reliability limits? (this is in pursuit of me trying to show my boss the original intent was different than what we're trying to achieve)
  • @Bev D you referenced it way back here - "the tradition of using 95% confidence - or 5% alpha risk - dates back to Sir Ronald Fisher. Although this is often misquoted as Fisher suggested that a 5% alpha risk woudl be sufficient for an analysis IF the experiment were replicated several times with the same results being statistically significant at a 5% alpha level each time. The Mil Std acceptance tables used a confidence level of 95% for AQL based plans and since then 95% has been traditionally used for confidence levels. Reliability is not traditionally anchored... " in the 'Defining Reliability and Confidence Levels' thread
Heavily related thread(s) & comments:
  • As @Tidge so clearly defined on another thread "Process validation: establishing by objective evidence that a process consistently produces a result or product meeting it's predetermined requirements"
  • @Bev D on the 'Sample size considerations in medical process qualification' thread you said - "Both the “AQL” and the confidence/reliability formulas are intended for inspection of lots from a process stream. From your post I think you are talking about validating that a process is capable of meeting a certain defect rate? This would be a completely different set of formulas." - What are those other formulas you recommend? My boss has said we should use the sample size for tolerance interval formulas as well.

Sorry for the long post, I wanted to be thorough :)
 

Steve Prevette

Deming Disciple
Leader
Super Moderator
First the easy thing. MIL STD 105 no longer exists. Whatever happened to MIL-STD-105? - Document Center's Standards Forum (document-center.com) , replaced by ASQ-Z1.4. In my (getting rather ancient) experience, there was no difference between 105E and Z1.4 that I could discern.

Yes, if you are doing destructive sampling, there is compensation for reducing the sample size. You say you have past data, you could consider doing a Bayesian analysis, using your previous data for the apriori distribution to generate the posterior from current data to get an estimated probability of failing the requirement.

You can also take advantage of continuous data (versus go no go) that will reduce your sample sizes.

59 is not necessarily in the MIL STD - in fact the MIL STD and Z1.4 rely on repeated batch samples to reduce the sample size. But 59 is a commonly known sample size based upon if you sample 59 things and have no failures, you are 95% confident no more than 5% are "bad". However, you don't need to use this since you have continuous data, and could calculate the 95/5 limit with much less data. How much less depends on your standard deviation of the results.

So you should look at continuous variable sampling plans, with good history justifying reduced sampling, and to top it off, run the Bayesian analysis.
 

ASchmitz93

Registered
Thank you @Steve Prevette . Following up on your actions, I looked into sampling plans for continuous data and arrived upon ASQ/ANSI Z1.9 - "
SAMPLING PROCEDURES AND TABLES FOR INSPECTION BY VARIABLES FOR PERCENT NONCONFORMING" as a standard with guidance on this. Do you (or anyone else on this forum) know if I've arrived at the proper specification here?

It seems to me I have. It has recommended sampling plans specifically for continuous data & tables you can use when you already have a known (or anticipated) standard deviation of the population.

Does anyone else have a recommended standard? @Bev D you'd recommended "other formulas" when doing process validation rather than lot monitoring, still would love to know if you have resources to point me towards.

Lastly I'm not terribly familiar with the bayesian analysis, if anyone has recommended resources for me in regards to my above question it'd be much appreciated.
 

Bev D

Heretical Statistician
Leader
Super Moderator
Process validation is very different from lot to lot release inspection.
ASQ/ANSI Z1.9 and 1.4 and the confidence/reliability methods are intended for lot to lot release testing/inspection. (they are commonly yet mistakenly used for validation testing but I seem to remember my Mom warning me about jumping off bridges just because all of my friends do).

There are a couple of approaches to validation testing using continuous data that work quite well. To keep it simple since you have a lot of historical data on this particular process (I assume you are validating an existing process for a new product or revised product) I suggest using either a point estimate approach using tolerance intervals (for the individual value spread as opposed to confidence intervals which are for estimating the precision of the average) or simple SPC. SPC approach will give you the smallest sample size as you are comparing current performance against past performance and the control limits will tell you if the variation in the current process is consistent with the past performance. Either approach should be done on 3 lots or batches from the process as that provides the necessary replication.

the formula for the point estimate of a mean is easy to find on the internet or any statistics book. Or I’m sure someone from here will post it.

I would never recommend or approve using process capability indices as they are just statistical alchemy. However, if you use one of the approaches described and you pass that the process capability index will be ok.
 

Semoi

Involved In Discussions
As was stated in the question it is common to use a sample size of 59 and zero-defect testing to obtain a reliability/confidence statement of 95%/95%. I feel it would be helpful to argue why this is bad practice. The same is true for the usage of process capability indices.
 

Bev D

Heretical Statistician
Leader
Super Moderator
I'm confused. what in our response has not stated our concerns in this and your other thread) regarding the inappropriateness of these approaches? Is it not thorough enough or strident enough? Please help us understand your position and what you are actually asking for...

I'll continue with asking you - and others interested in this discussion - to read my article on Statistical Alchemy in the Resources section. There is a section on process capability indices that is fully researched and referenced. I won't retype here.

As for the use of any confidence/reliability formula I will add (somewhat redundantly): these are intended to only tell you that the specific lot under inspection is probably not more than the stated reliability level. These formulas don't tell you if the process average is stable or far from the target or really close to the target...Which is fine for lot to lot inspection but that isn't what validation is about. Validation is about estimating the actual defect or process average/SD over the long run which requires the estimation formulas for an average, standard deviation or proportion. These result in larger sample sizes and better understanding of the actual process capability.

Of greater concern however is not the formulas but the sampling frame. If you are not sampling from a set of process conditions that span the range of allowable variation you are only fooling yourself. No statistical formula can tell you anything about something you didn't sample...

Now, There is an acceptable basis for using lot inspection formulas for PQ IF (and that's a big if) you have performed a real thorough OQ, running the process at it's min/max levels of process inputs.
 

Tidge

Trusted Information Resource
Validation is about estimating the actual defect or process average/SD over the long run which requires the estimation formulas for an average, standard deviation or proportion. These result in larger sample sizes and better understanding of the actual process capability.
I have worked with MANY people over my (now long) career that simply did not understand this. I could write more, but my frustration prevents me from making a more coherent argument then was made by @Bev D .
 

Semoi

Involved In Discussions
@Bev D: Thank you for linking the document. I was not aware that you wrote and shared it and I believe that linking it again is beneficial to many. I read your document and believe that many of your statements are true. However, some of my conclusion differs from yours. I believe it is best if I leave comments at the page you linked. However, as my intension on this page is to learn, I really do not want to get onto your neves.

As the confidence/reliability formula is not discussed in your document, I will put my comment in this post. I pick a statistical point in your statement, because this is where I feel confident to contribute something.
It was Wilks who started the "confidence/reliability" discussion with his paper: "Determination of sample sizes for setting tolerance limits". He presented two formulas: (1) one non-parametric formula in which he uses the CDF of the ordered statistics, and (2) a formula for normally distributed data. In his paper I'm unable to see the reference to any finite lot size. The way I understand the paper is that he discusses a population (of continuous random variables) and answers the question how we are able to calculate tolerance intervals for this population. Therefore, if we use a sample size of 59, we can be 95% confident that 95% of the samples in the population exceed the minimal value of this sample, Y_(1). Therefore, if the minimal value Y_(1) exceeds the lower specification limit LSL, we can be 95% confident that at least 95% of the population exceed LSL. Of course it is possible to consider lots of finite sizes, but my point is: This particular formula is not intended for a specific lot size. Therefore, mathematically it's correct to apply it to the population.

With all your other statements I agree: The formula does not make any statement about
(1) the average value,
(2) the stability of the process,
(3) the distance to the specification limit, or
(4) the standard deviation.
But I am not sure if this is necessary, because the OQ/PQ is not meant to replace/contribute to the development phase. During the development phase the organisation must understand and optimise the process, refine the external specification limits (if needed), establish the internal specifications, and define the action plan. The OQ/PQ is merely a verification of our knowledge -- it is not meant to add something new to our knowledge.

Just the other day we invited a well-experienced consultant to discuss/present strategies for OQ/PQ. He presented a process in which he identified 4 critical factors. Then he asked us to generate an OQ sampling plan. Most of us came up with a full-factorial design (2^4=16) using several replicates. He explained that this is the wrong strategy, and that we ought to perform these experiments in the development phase. He added, that the OQ/PQ runs should consist of as many samples as necessary to satisfy the requirements -- we need to know in advance what our process is capable of, and what our worst-case scenarios are. Only after we gained this knowledge we are confident that the power of our hypothesis tests used during OQ/PQ are approx 90%. If we reach this power using 59 samples -- great!

Maybe one more thing: To my understanding we have to simulate the worst-case scenario during OQ. Are there other options? How does the OQ differ from the PQ, if we do not use worst-case scenarios? The GHTF report reads: "In this phase the process parameters should be challenged to assure that they will result in a product that meets all defined requirements under all anticipated conditions of manufacturing, i.e., worst case testing."
 

Tidge

Trusted Information Resource
Maybe one more thing: To my understanding we have to simulate the worst-case scenario during OQ. Are there other options? How does the OQ differ from the PQ, if we do not use worst-case scenarios? The GHTF report reads: "In this phase the process parameters should be challenged to assure that they will result in a product that meets all defined requirements under all anticipated conditions of manufacturing, i.e., worst case testing."

The OQ is performed at the allowable extremes(*1) of the process. There will be variability of process outputs due to running at the allowable extremes, how much variability due to the process is studied in the OQ.

Typically there is a nominal setting for the process parameters(*2). The PQ studies the variability of the process due to the inherent variability of the processes as designed. This variability is outside of the control(*3) of the design team, it comes from variability in raw materials, time of day, people executing the process, etc.

(*1) The allowable extremes were already decided upon in the "development phase", before executing the OQ
(*2) We don't allow/encourage the process to be run at the extremes, even if it lets the operators get to lunch sooner. Those extremes are supposed to be guard-bands, not opportunities for personal preference.
(*3) That is, there are no "process parameters" to control or test in a PQ... new material will show up on the dock, new people will execute the process.

It is entirely possible process outputs can be such that they have very little variability, even with relatively wide ranges in process limits. In practice, this variability in outputs usually isn't known. Look at pages 15+ of the GHTF document that was linked. If process outputs drift over a shift, this is almost certainly NOT going to be caught during an OQ.

Re: "worst case": "Worst case" refers to the outputs of the process, not the allowable extremes of process parameters. In my experience: the most common way "worst case" comes into play during the OQ planning is when there are multiple process parameters and it is well-established (usually through engineering fundamentals) that certain combination of process extremes (correlated or anti-correlated) will be the most likely to generate non-conforming outputs. This can allow a team to argue that they can reduce the number of OQ challenges. Off the top-of-my-head: If there was a pouch-sealing process where the only two process parameters (time, temperature), I think it could be argued that instead of 4 challenges (short time, low temp), (long time, low temp), (short time, high temp), (long time, high temp) the "worst cases" could presumably come from (short time, low temp) and (long time, high temp). Note that in this gedanken experiment I've already somehow decided (engineering knowledge?) that pressure of this sealer is not variable (or not variable enough) to merit being challenged.
 

Semoi

Involved In Discussions
@Bev D: I just realised that in your initial post you refers to "ASQ/ANSI Z1.9 and 1.4". These plans use a finit lot size, while Wilks paper does not. My apologies.
 
Top Bottom