Sample size for design verification of variable in single use device

#11
Hmmm, I may need to elaborate a little. Typically when doing risk management, an unacceptable risk (based on severity and probability of occurrence) may be identified. A risk mitigator is used to lower the risk to an acceptable level. The risk mitigator needs to be verified to have objective evidence it functions as intended. To define a sample size to test said mitigator it makes sense to me to use the pre-mitigation level as this effectively tells you how important the risk that you are reducing is. Using the post-mitigation risk level doesn't make sense to me. Thoughts?
I am in agreement with everything Bev explained. I may have worded things confusingly, but I believe we're both saying the same things. For new products, first test to a pre-mitigation severity and occurrence. Once you show you can meet your product requirements using this sample size, Implement your risk mitigators and lower the occurrence which could lower the risk level. Any issues that come up in the future that require re-evaluation of product can use the new risk level.
 
#12
Hi, I only have access to the draft version of ISO18562, in a couple of places, it says "To produce a meaningful result, more than one medical device may be required to be placed in series or measured sequentially.". So I would use this to allow testing more than one unit at a time.

It is common to divide risk into high, medium and low, based on patient risk, high being life threatening or death, medium being risk of permanent injury, etc. Then allotting confidence and reliability to each category. Using a binomial test, a high risk would be 95%/99% giving 299 samples, medium 95%/95% requiring 59 samples and low 95%/90% requiring 29 samples. From your risk files, you should be able to estimate the risk associated with particulate inhalation and hence establish a sample size.

I have seen this used by many medical device companies and generally accepted by FDA provided the rationale is clearly documented.
 

david316

Involved In Discussions
#13
Hi, I only have access to the draft version of ISO18562, in a couple of places, it says "To produce a meaningful result, more than one medical device may be required to be placed in series or measured sequentially.". So I would use this to allow testing more than one unit at a time.

It is common to divide risk into high, medium and low, based on patient risk, high being life threatening or death, medium being risk of permanent injury, etc. Then allotting confidence and reliability to each category. Using a binomial test, a high risk would be 95%/99% giving 299 samples, medium 95%/95% requiring 59 samples and low 95%/90% requiring 29 samples. From your risk files, you should be able to estimate the risk associated with particulate inhalation and hence establish a sample size.

I have seen this used by many medical device companies and generally accepted by FDA provided the rationale is clearly documented.
Consider this example:

I have a medical device that runs off mains power that may be exposed to water and create an electrical risk. In my risk assessment assessment I have made the following assessment:

Risk (electrocution) related to water ingress - high
Risk mitigator - must meet water ingress protection to IP22.
Risk with mitigator in place - low

Note, the risk with the mitigator in place is low because the risk management team agrees that IP22 with reduce the occurrence of the mains electricity being exposed to the user.

Prior to product release I need to verify I meet IP22 water protection. So if I am following "high risk would be 95%/99% giving 299 samples, medium 95%/95% requiring 59 samples and low 95%/90% requiring 29 samples", which risk (high or low) do I use to justify my sample size? I would assume I use the pre-mitigation risk (high) because that is effectively telling you how important the risk mitigator is. Using the post risk doesn't make much sense to me.

Hopefully I am not sounding like a broken record ....

Thanks for any input.
 

Jean_B

Involved In Discussions
#15
A sidenote I've always wondered about:
The sample sizes given are derived from the success run theorem (for attributes), with parameters r% for 'reliability' and c% for confidence interval (sometimes called confidence level in this context), and the assumption of a sufficiently large sample.
This is then often stated to give you c% confidence that the mean of product reliability is above r% (for that batch).
Taking into account we often take 95% confidence to be an 'industry expectation', we vary product reliability r% based on the 'risk' (sometimes to intuitively astonishing abysmal degrees).

Yet I've never seen a substantiation on the power of the testing of the sample that leads to the industry-wide 'statistically valid rationale'. The assumption of perfect test information on the sample is often neglected. (As I recall only Bev once stated a link to MSA and her painstaking care to preload the sample with known representative quantities and defect type ratios preceding such a sampling exercise.)

What gives with this blind-eye the regulators often turn on the underlying related testing matters of power, sensitivity and specificity, etc.? Sometimes I feel our (medtech) industry safety is built more on professional competence in design principles, industry habit and the law of large numbers that safety wins out instead of any added value from the 'statistically valid rationale' providing a substantiated precise number. We often don't even link back to our assessed reliability to see whether we were on the mark (though I expect the industry to evolve on this; perhaps with renewed focus once the ISO 14971 updates).
 

Bev D

Heretical Statistician
Staff member
Super Moderator
#16
David316: you're confusion is well founded. Risk is a terrible word as too few people really understand it.
Risk as it used in the quality industry is a vector: severity and occurrence are legs of that vector. (detection is mistakenly assigned as a third leg in the old fashioned incorrect approach of risk assessment. it's actually a mitigation to occurrence and is not independent of occurrence.)

Assigning AQL/RQL levels or Reliability/Confidence levels by SEVERITY is correct. the higher the severity the more sure you want to be that your mitigation(s) are working. AQL/RQL and "reliability" speak to the allowable defect rates. the higher the severity the lower you want the defect rate to be.

Sample Size is a function of the defect rate you are trying to catch - or determine. the lower the defect rate, the higher the sample size must be.

The weakness with using AQL/RQL and reliability/confidence is that they are intended for acceptance sampling and not validation. they are limited in their usefulness as they only give a max limit and they are very underpowered both practically and statistically. using a sample size formula that is intended to help you estimate the actual rate is much more useful.

Here is where your confusion is correct: pre-mitigation, the defect rate on your high severity failure mode is probably fairly high. since design is supposed to take mitigation action fro these known high severity failure modes you probably wouldn't need to test at all - just put the mitigation in place. Then you probably should test if the mitigation is acceptable. The failure rate should be low and the severity hasn't changed, so the sample size should be large. But now you are in design verify/validation so the variations introduced by manufacturing dont' come into play. Also, you could perform testing under worst case conditions. without being foolish, use the most water for the longest duration under the highest pressures that might be expected. This allows you to lower your sample size to just a handful of parts. you could at this point test the minimum design tolerances, nominal and maximum designed tolerances and this would reduce your sample size to 3 as worst case testing is deterministic not probabilistic. (minimum and maximum tolerance are in terms of the water ingress ability not the actual physical dimensions. so a max dimension that creates a really tight fit would be a 'minimum' probability of water ingress. while a smaller dimension that creates a somewhat looser fit would be a max water ingress condition.)

you see, sample size is not just a statistical question but a physics question. the two are not independent. If you don't take physics into account statistics is just gambling...
 

Bev D

Heretical Statistician
Staff member
Super Moderator
#17
Taking into account we often take 95% confidence to be an 'industry expectation'
ooh. you've just hit the proverbial nail on the proverbial head. the 95% confidence is a MYTH. it is actually based on a misunderstanding of what Sir Ronald Fisher said when asked what an acceptable; "alpha risk" should be. he responded that 5% was reasonable for multiple replicates and only moderately severe problems (he was in agriculture for Pete's sake)

The 5%/95% is only used because it's what we've always done. and to many people don't have the mental fortitude to think about these things critically. that's why auditors and even statisticians reflex to it. it's easy and they don't have to think. it's usually completely wrong when we don't have replication and we don't take in to account that most processes are non-homogenous.
 

Statistical Steven

Statistician
Staff member
Super Moderator
#18
Hi All, long time reader, first time poster. I am a mech. engineer, frequently involved in "bench testing"of medical devices.

I have trauled through the forums many times over the last few years on sample size selection and never really get much wiser. I've asked as many colleagues as I can in my 7 years in the medical industry, but have never been given any good reference for developing a 'valid statistical rationale' for design verification. I have seen many good references for process capability and batch acceptance but never for design verification. To date I have been either using a standard which dictates sample size (e.g. sterilization validation and BioC) or somewhat arbitrarily picking a number that I think will represent the nominal design. Some of my testing has been submitted in 510(k)s or in EU submissions for class II devices. My sample size rationale has never been questioned, but more and more I hear people talking about valid statistical rationale.

So here is my current case:

I have a disposable medical inhaler device - for sake of arguement lets say its a gas mask which filters incoming air with activated carbon that you only use for 20 minutes before discarding - which i need to test for airway particulate. The device is already manufactured and sold. Batch sizes are between 25k to 100k. ISO 18562-2 defines a limit for airway particulate emitted by the device. That limit is 12 um/m^3 (12 micrograms of particulate, per cubic meter of air inhaled through the device). The standard says the recommended test methods are "Type Tests" which to me means 1 sample. However, given this is a disposable device, and the particulate is likely to vary - it actually does contain activated carbon granules which are assembled into the device manually - i am concerned that 1 sample would not represent the nominal design. What rationale can i use in selecting a sample size for this application?

I have done some preliminary tests which indicate the average is 4 um/m3 across 10 devices. Without getting into details - that test took 10 hours, and does not allow me to measure the particle density for individual devices, only the average across all.

Note that the above standard has only recently become recognized, hence why i am testing after the product is released (nothing dodgey here!)

TIA!
If a standard exists, then it is considered "statistically valid" even with a sample size of 1. If the standard is not clear, or does not specify explicitly the sample size, you need to justify it based on your Risk Management process (following ISO 14971). Many organizations put in place sampling plans for design V&V that are based on the Risk Management and FMEA, so all devices are sampled in a similar fashion for similar risk.

If the test for particles is a continuous variable, a recommended strategy is to test 30 units (this is arbitrary but sufficient to estimate the mean and standard deviation). Compute a 95% one-sided prediction interval. If the upper limit of the prediction interval is less than the specification in the standard, you would pass design V&V. If you have sufficient design data, you can use an estimate of the mean and standard deviation and calculate the appropriate sample size (instead of 30) needed for the upper 95% PI to be less than the upper limit in the standard.
 

david316

Involved In Discussions
#19
David316: you're confusion is well founded. Risk is a terrible word as too few people really understand it.
Risk as it used in the quality industry is a vector: severity and occurrence are legs of that vector. (detection is mistakenly assigned as a third leg in the old fashioned incorrect approach of risk assessment. it's actually a mitigation to occurrence and is not independent of occurrence.)

Assigning AQL/RQL levels or Reliability/Confidence levels by SEVERITY is correct. the higher the severity the more sure you want to be that your mitigation(s) are working. AQL/RQL and "reliability" speak to the allowable defect rates. the higher the severity the lower you want the defect rate to be.

Sample Size is a function of the defect rate you are trying to catch - or determine. the lower the defect rate, the higher the sample size must be.

The weakness with using AQL/RQL and reliability/confidence is that they are intended for acceptance sampling and not validation. they are limited in their usefulness as they only give a max limit and they are very underpowered both practically and statistically. using a sample size formula that is intended to help you estimate the actual rate is much more useful.

Here is where your confusion is correct: pre-mitigation, the defect rate on your high severity failure mode is probably fairly high. since design is supposed to take mitigation action fro these known high severity failure modes you probably wouldn't need to test at all - just put the mitigation in place. Then you probably should test if the mitigation is acceptable. The failure rate should be low and the severity hasn't changed, so the sample size should be large. But now you are in design verify/validation so the variations introduced by manufacturing dont' come into play. Also, you could perform testing under worst case conditions. without being foolish, use the most water for the longest duration under the highest pressures that might be expected. This allows you to lower your sample size to just a handful of parts. you could at this point test the minimum design tolerances, nominal and maximum designed tolerances and this would reduce your sample size to 3 as worst case testing is deterministic not probabilistic. (minimum and maximum tolerance are in terms of the water ingress ability not the actual physical dimensions. so a max dimension that creates a really tight fit would be a 'minimum' probability of water ingress. while a smaller dimension that creates a somewhat looser fit would be a max water ingress condition.)

you see, sample size is not just a statistical question but a physics question. the two are not independent. If you don't take physics into account statistics is just gambling...
Thanks Bev. I have read a few articles around this and I think my confusion comes from the poor use of the word risk. As you have alluded to the use of severity is more appropriate but articles on the net often use risk when talking about severity which is very confusing!

E.g. Sample Sizes: How Many Do I Need? | 2014-07-07 | Quality Magazine
 

Top Bottom