Sample Size for Distribution Simulation Testing

PkgTest

Registered
Hi,

I am new to this board and have been reading many of the posts on sample size. What great information. I have a question about determining sample size for distribution simulation testing.

I work for a consumer products company that produces a wide range of products. We ship door to door via the single parcel distribution channel (UPS, FedEx, etc.). All our products are shipped together in one box based on what the customer orders. We conduct distribution simulation testing in our lab using vibration and drop equipment. We have a set mix of mock products that we put into the box along with the sample that we are testing. The mock product mix is based on average customer orders taken over several months. We also use a 0-3 scale (no damage, minor, moderate, and major) that we classify each test sample. Zero to 1 is a pass, 2 is borderline and 3 is a fail. The samples we receive for testing are collected during manufacturing line trials. They typically run about 2000 samples during these line trials and samples are submitted to various groups for testing, our group being one. Right now, we request 59 samples based on attribute data sampling (n=59 for 95/95) and we say that if we see zero failures after testing, we have an upper limit of 5% damage. For more high-end products, we have even used a sample size of 120 to say we have an upper limit of 2.5% damage. Cost of the product is not a factor.

Does this approach make sense? Can I use the 0-3 scale as variable data to lower the sample size? If I can use the scale as variable data, what formula would I use to calculate sample size? We do many tests, so reducing sample size while still keeping a good confidence in the results would be appreciated.

Thanks in advance for any information.
 

Miner

Forum Moderator
Leader
Admin
@Bev D please chime in here. Bev has discussed worst case testing schemes that minimize the need to test such large sample sizes.
 

Bev D

Heretical Statistician
Leader
Super Moderator
Give me a few hours and I’ll publish a brief description of how - and why - we do this with really small samples.
I’ll also describe why you absolutely cannot treat the 1-2-3 ordinal scale as variables data. Although @Miner will likely be me to that ;)
 

Miner

Forum Moderator
Leader
Admin
Without going into a lot of detail, it is really a matter of how much information is contained in the data. Pass/Fail (binomial) data has the least information content as there are only two states possible with the counts for each state. Ordinal date such as your 0-3 scale provides a little more information because it adds more states (4 for your scale) plus some directionality (severity in your case). However, ordinal data cannot provide the amount nor the quality of information provided by continuous data, which has an infinite number of states. Ordinal data cannot provide information on how much more severe a 1 is to 0, or a 3 to a 2. Continuous data can provide that. 2 kg has exactly twice the mass as 1 kg. 10 kg has exactly twice the mass as 5 kg. Ordinal data cannot provide this information. In addition, scales are often nonlinear. Take the ordinal classifications of hot peppers vs. the Scoville SHU measurement.
Sample Size for Distribution Simulation Testing
 

Attachments

  • Sample Size for Distribution Simulation Testing
    1615563042784.png
    53.2 KB · Views: 35

Bev D

Heretical Statistician
Leader
Super Moderator
Sorry I don’t yet have the more detailed write up -with pictures. I can probably post it on Tuesday. Until then the idea is to test at the extremes. By extremes I mean at the specification limits which are supposed to be set to guarantee no failures. The limits are supposed to be determined through experimentation. Of course this doesn’t always happen, but certainly stuff in teh middle of the specficaitons will likely work. The idea of the confidence/reliability or any other sampling plan for validation (as opposed to lot to lot acceptance testing) is to randomly select samples from across the specification range so that you get some good and some ‘bad’. IF the sample is truly random from a population that is in fact representative of the distribution of parts that will be produced in the future across the specification range THEN the testing will give you a “pass/fail” idea that the future population will have no more than the established defect rate. The defect rate may be less than that. This type of sampling is not intended to estimate the likely defect rate. Given that it is a single sample there is some probability that the real defect rate is higher than your stated reliability. The larger failure with these types of sample plans is that you actually don’t have a truly representative sample because you didn’t sample from a representative population - that would require you to build parts at the limits of the specification and in validation many people rarely do that - they produce at the target. (Or the starting position in the case of mold wear)

For situations where the failure may be dependent on an interaction (tolerance stack-ups or conditions for failure) you have a distinct advantage to leverage in addition to using only parts at the extremes. Many failures are a result of the stress-strength interaction. If the stress is strong enough and the part is weak enough you get failure. Under the same high stress a strong enough part won’t fail. Testing of packaging is exactly this situation without a lot of imagination. So if you compile the ‘weakest’ packaging allowed by your specifications, the worst case packing scenario AND the worst case stresses (but not foolish, the stress has to actually exist in real life and be one that you guarante survival of). Then you only need to test ONE sample. If it passes everything stronger will pass dunker weaker stresses. If it fails you know you have to improve something on your end.
(It’s like the ‘mouse in the house’ problem: do you really need to establish exactly how many mice are in your house before you begin eradication procedures? Don’t you KNOW that if you saw one, there are many more? And they didn’t ring the front door bell to get in...)

So when you test at the extremes only you dont’ Need a statistical sampling plan. You are not trying to overcome the unknown distribution of parts and the unknown distribution of conditions in the field by random sampling - you have deliberately selected the worst case parts and are testing at the worst case conditions. If you pass there you pass at the best case case conditions...
 

PkgTest

Registered
Thanks Bev D. Regarding testing at the extremes, by your definition, “the specification limits which are supposed to be set to guarantee no failures.”, I think that may be where our testing is not right. The distribution test that we run in our lab is based on vibration and drop field data we recorded over 100’s of actual truck routes in our biggest market. We put these recorders in our boxes and sent them out over several months. We then took that data and set our test at the 95th percentile of the vibration and drops we recorded. So basically, we are testing near the worst extremes we saw in the field. This pretty much guarantees we see failures during our test instead of guaranteeing no failures. On the occasion where we see no failures, we feel really confident that our product will survive the shipment to our customer. On the other end of the spectrum is we typically see a good % of failures during our test. So the question we face then becomes, so now what? Will we actually see this many failures in the field? Well, I guess, if we are shipping to our worst routes. But those routes only comprise a small % of our shipments (according to our field recorder data, only the top 5%). Should we set our test at the 50th percentile of the data we collected since you said the middle of the specification will likely work?

I have read many of your posts on sample size and have learned so much. I really look forward to your response.

Thanks.
 

outdoorsNW

Quite Involved in Discussions
If I understand correctly, the data was collected from a sample of shipping on typical routes with typical products. If you test at the 95% percentile level, you will likely see failures of around 5%. These may or may not be acceptable. If your goal is high customer satisfaction, they probably are not. If you test at the 50% percentile, then about 50% of your shipments will be damaged in shipment.

If your test focused on the most damage prone products, then the failure rates will be better because most products are not as likely as the test products to be damaged. If you knew what shipping routes and methods create the greatest risk of shipping damage, then the real world numbers will be better. But is is not clear to me that your testing to date did anything to focus on worst case type conditions.
 

PkgTest

Registered
We sent our multiple field recorders multiple times on more than 100 different routes in our biggest market. But let's say we sent it on exactly 100 routes, we selected the data from the route that had the 95th percentile worst vibration and drops. We use that 95th percentile data to run our test. So, only 5 routes experienced worst vibration and drops then what we test. Currently, we use the attribute sample size of 59 (c=0, 95/95). If all samples pass, we are confident that our field damage will be low (but another question is are we overpackaging?). However, if samples fail our 95th percentile test, what percentage of failures would we have seen at the 90th, 75th or 50th percentile to put the damage in perspective? Is testing at the 95th percentile vibration and drops the proper way to go about this and can a sample size less than 59 be used? Thanks.
 

Bev D

Heretical Statistician
Leader
Super Moderator
I have attached a brief overview of the probabilistic approach to V&V testing along with a critique of the weaknesses of using acceptance sampling plans for V&V testing.

I would also like to comment on two statements that PkgTest has made. apologies if I didn't interpret their comments correctly.
"if samples fail our 95th percentile test, what percentage of failures would we have seen at the 90th, 75th or 50th percentile to put the damage in perspective?" This is a valid question of course if you are trying to confirm a failure rate greater than ~5% at your worst case packing. This type of margin assessment should be completed prior to validation testing. It is actually a development activity. One should never go into Validation without knowing what you will discover. remember that validation is supposed to be a confirmation that you got your design right to meet requirements. (although I realize that this is too often not the case). If you fail at worst case stress then you should correct your design to meet requirements. the goal is not to 'pass' validation, it is to have very few damaged packages.
"Is testing at the 95th percentile vibration and drops the proper way to go about this and can a sample size less than 59 be used?" I typically use 1-3 units (based on cost and the organizations nervousness about only using one part for V&V testing) at the worst case condition. If I have no failures at worst case conditions with worst case parts I will have no failures in the field.

Now of course in my organization, we test many characteristics across many vintages of products so the total sample size used reflects that.

Also we must acknowledge that some failures will still occur. the use conditions can change drastically - a truck gets squashed by a giant boulder in a mud slide (happened!) or other thing that was equally catastrophic and in the realm of force majeure. for these situations we have recovery and compensation plans in place. But these are truly rare and should never be a reason to NOT improve the quality of your product. There are also some characteristics that we are unaware of in early product manufacture and they can rise up to get us. in these cases we institute corrective actions and IMPROVE OUR KNOWLEDGE of the system.
 

Attachments

  • Sampling for Validation Testing.docx
    28.8 KB · Views: 341
Top Bottom