The Elsmar Cove Wiki More Free Files The Elsmar Cove Forums Discussion Thread Index Post Attachments Listing Failure Modes Services and Solutions to Problems Elsmar cove Forums Main Page Elsmar Cove Home Page
Google
  Web Elsmar.com
*Please be aware that SOME RECENT forum threads may not yet be indexed by Google.

View Full Version : Normal distribution - I have a series of data but I don’t know the distribution


smedboy
9th July 2009, 10:41 AM
Hello everybody.
I have a doubt.
When I make an single experiment, I obtain a series of numerical data.
Then I calculate my mean, and my standard deviation. Does it have any sense to calculate +/- 3 sigma?
Let me explain. I have my series of data, but I don’t know what distribution this data has, and +/- 3 sigma applies only with normal distribution.
I know that means usually follow normal distribution, so I have always divided my series in smaller groups calculating than the 6sigma on the means of these groups.
Now I’m starting to doubt that this is a correct process.
My questions are:
- how do I establish the type of distribution or how can I approximate to normal distribution ?
- Can I form my groups randomly or should I use some parameters to group them?
Any other explanation you can give is welcome.
Thanks in advance……………

Jennifer Kirley
9th July 2009, 10:53 AM
Welcome to the Cove!:bigwave:

We don't always use normal distribution. What are you testing for - what is the experiment intended to tell you? There are different kinds of tests. See this page on statistical analysis (http://www.quantitativeskills.com/sisa/statistics/t-thlp.htm) to understand what I mean.

smedboy
9th July 2009, 02:14 PM
For example:
I make 50 measures of the extraction force(N) necessary to pull out my plastic cylinder from its seat in witch was planted.
How should I analyse this data? How do I establish the distribution?
Should I divide them in groups of 5 and calculate +/3 sigma on the means of the groups?
Or am I getting it all wrong?

Bev D
9th July 2009, 02:38 PM
I think the dilemma you are facing is that the structure of how you take samples and the analysis you need to perform is dependent on what you are trying to determine. (The distribution type and the applicability of the Normal distribution are secondary or even trivial in relationship to determining which experimental structure you need) So let's set teh distribution question aside for a moment.

Let's expand on your example: are you extracting the same cylinder 50 different times? or are you extracting 50 different cylinders? why are you measuring the extraction force - what are you trying to learn about extraction force: are you trying to determine how well you can measure extraction force or are you trying to determine how much variation you have in extraction force (or holding force? are you using extraction force as a surrogate for how well the cylinder is held in its seat or are you really interested in how much force it takes to extract the cylinder...)

if you can answer these questions - and post any data you may have with an explanation of hwo the process works and how you collected the data - we can provide you with a more useful and specific answer...

smedboy
9th July 2009, 04:44 PM
Well I have a new production process that consists in planting plastic cylinders in one cavity of our product.
What we are trying to establish is how many N (case of failure)are necessary to take out the cyilinder.
I have made 50 tests, as a start.
In order to garanty a good repetability, I have used the same planting machine(we have 3) and always the same molding figure of the cylinder and of the cavity (they are small pieces so there are several figures in the molding stamp). Every time I have used new pieces bacuse the tests nature is destructive.

Bev D
10th July 2009, 07:34 AM
another couple of clarifying questions:
I am interpreting what you are saying as that each time you extract the cylinder you record the force? and you are extracting 50 different cylinders from the same part/cavity becuase you destroy the cylinder when you extract it? if so then you aren't interested in the "N" (number of extractions) you are interested in the force required to extract???


if I have the interpretation above correct, then I am assuming that you are really interested in the force required to extract the cylinder. this is most likely a surrogate test for your true interest in the 'holding' force of the cylinder????

If the above is correct could you tell us what conditions and forces the part/cylinder will be subjected to in use? In other words why are you concerned about the holding force? and why are you concerned about repeatability? repeatability of what?

I ask these questions because we must first understand the physics of the situation BEFORE we try to apply the correct statistics.

that said and with the limited understanding I have of your situtation I would recommend two things:

Don't worry about 'repeatability' (that is typically a false issue in these cases) and select 30 RANDOM assemblies and extract the cylinders and record the force. You really need to understand variation of your process first and extracting 50 differenct cylinders from a single part will not provide you that info.
post your data and we can take a look at it to determine what distribution - if any - applies and how we might use the data to predict the capability of your assembly to hold the cylinder given it's use conditions



please note that random means random. but it also means ensuring that your samples have as much opportunity for variation as possible. this means different lots of subassemblies, materials or assembly lots depending on how the process works - if at all possible.

smedboy
10th July 2009, 08:57 AM
1) With N I was indicating Newtons that is the force we use to pull out the cylinder.
2) I was using a new cylinder and a new cavity avery time because, the plastics deforms lossing it's geometrical properties.
3) I was concerned about repeatability because, I have 3 cylinders planted at the same time with 3 diffrent planting machines(that teoricaly are the same but in reality thera are 0.05mm of diference) and because as you know in plastic molding products from the same stamp but diffrent cavities are newer compleatly identical(up to 0.03mm of difference). That is the reason why in previous tests in witch we did not group the pieces we had a big standard deviation. In this way we have reduced SD by half now.

Bev D
10th July 2009, 01:54 PM
1) With N I was indicating Newtons that is the force we use to pull out the cylinder.
2) I was using a new cylinder and a new cavity avery time because, the plastics deforms lossing it's geometrical properties.
3) I was concerned about repeatability because, I have 3 cylinders planted at the same time with 3 diffrent planting machines(that teoricaly are the same but in reality thera are 0.05mm of diference) and because as you know in plastic molding products from the same stamp but diffrent cavities are newer compleatly identical(up to 0.03mm of difference). That is the reason why in previous tests in witch we did not group the pieces we had a big standard deviation. In this way we have reduced SD by half now.


OK I get poitns 1 and 2 now. makes sense. but 3 doesn't: I understand that testing only 1 of the 3 cavities will result in less variation in your data set. BUT aren't you worried about the total actual performance of your product in the field? In that case won't all 3 cavities adn their total be involved? Maybe I'm missing something, but if the variation is too large shoudln't you concentrate on physically reducing the variation in all 3 cavities? what benefit do you gain by only analyzing one cavity?

that said: if you post your data we can help with the original question regarding the statistics. It is impossible to provide a specific response without the data. The only thing we could do at this point is give you a long laundry list of possible approaches, cautions and admonitions....

bobdoering
10th July 2009, 02:07 PM
1) With N I was indicating Newtons that is the force we use to pull out the cylinder.
2) I was using a new cylinder and a new cavity every time because, the plastics deforms losing it's geometrical properties.


From my pull testing experiences, these are very valid points. You are truly dealing with destructive testing.

3) I was concerned about repeatability because, I have 3 cylinders planted at the same time with 3 different planting machines(that theoretically are the same but in reality there are 0.05mm of difference) and because as you know in plastic molding products from the same stamp but different cavities are newer completely identical(up to 0.03mm of difference). That is the reason why in previous tests in witch we did not group the pieces we had a big standard deviation. In this way we have reduced SD by half now.

This is good information to know - do the 3 machines have significantly different influences on the planting results. They (including their tooling) may, and that is handy info for reduction of variation. But, conversely, as Bev states, the total variation is going to represent the field response.

smedboy
10th July 2009, 04:21 PM
Twice the same post.

bobdoering
10th July 2009, 05:41 PM
My question is has it any sense to calculate +/- 3 sigma on 50 samples like this obtained? isn't +/- 3 sigma calculated on means of few sample groups? How should I analyze this data?

I would expect the physical phenomena and the test method error to have a total variation that is normal. How much of the variation is test, and how much is physical is an interesting question.

By allowing the product to cool, it more closely correlates to actual use - but, you may also find more variation. I would not be surprised by that - but it may not matter.

So, since I would expect the data to be normal, it is reasonable that +/-3 SD for your data should provide a estimate of the probability of the spread of the data of the population. It does not have to be of the means of the sample groups. That is really more of an issue of evaluating variation over time. Right now, it does not appear to be what you are trying to analyze. It can be of the total 50 pcs.

smedboy
11th July 2009, 04:40 AM
Ok you are right. Eventually, as planed, we will make other tests combining all stamp figures of the cylinder with all cavities and all planting machine.
I may as well explain the whole process then. I’ll try to be brief:
In this process we take the 3 cylinders, we heat the cavity and then plant the cylinders. The cavity shrinks a little, plastic attaches because of the heat and the trick is done.
This process is an evolution of the previous one in witch the cylinder extraction force, that we are measuring now, was around 100 N. Now we have a minimum of 190 N and max of 400 N, with al intermediate values randomly distributed in the middle with big Std.Dev. Now we have to implement a test station in the production line that will, on randomly sampled pieces, make pull test. Because of the heating process we have to establish how much time should pass before we make the test with realistic response. We know for shore that after 10 min we are in ideal condition with max extracting force. So we decided to make tests in different moments after 10-20-30 sec and so on and compare them with the ideal case. But because of high StdDev we weren’t able to se clear the difference between the ideal case after 10 min and the cases after 10-20-30 sec. So we made homogeneous groups with 50 cylinders of the same figure, in the same type of cavity with the same planting machines. In this way we transformed some of the variables in constants reducing by 50 % StdDev. It was still high but better. The results were much more homogeneous in both cases (after 10 min and after 10-20-30 sec). But my problem is more conceptual one than physical. I know hat these samples do not have a normal distribution.
And in this case, in this way, I think it has no sense to calculate +/-3sigma on a mean of a single population. I think that it has more sense to calculate +/-3Sigma on mean of means of several sampling groups right? In this way it indicates where my process is “cantered” ? And generally speaking how should I behave with non normal distribution cases?
Any other advice is welcome!
Last question: can you tell me where I can find some tutorials or real life examples(like minitab quality trainer) on SPC. I know the theory but I am kind of in trouble putting it to practice and things like ANOVA and similar could really help me in some cases.
Thanks.

Bev D
11th July 2009, 09:11 AM
But my problem is more conceptual one than physical. I know that these samples do not have a normal distribution.
And in this case, in this way, I think it has no sense to calculate +/-3sigma on a mean of a single population. I think that it has more sense to calculate +/-3Sigma on mean of means of several sampling groups right? In this way it indicates where my process is “cantered” ? And generally speaking how should I behave with non normal distribution cases?

OK - it's good that you know you dont' have a Normal distribution. but again unless you post the data we can't help you with this real life example.

If you don't have a Normal distribution, it doesn't make sense to calculate +/- 3 SD limits at all.

On the other hand, sample averages MAY have Normal distribution or a Student's t distribution if your sample size is relatively small. BUT the bigger question is: why do you care about sample averages?

bobdoering
11th July 2009, 10:26 AM
But my problem is more conceptual one than physical. I know that these samples do not have a normal distribution.

And in this case, in this way, I think it has no sense to calculate +/-3sigma on a mean of a single population. I think that it has more sense to calculate +/-3Sigma on mean of means of several sampling groups right? In this way it indicates where my process is “centered” ?

The data from the individual machines should be normal, in my mind, based on the phenomena you describe. But, at this point, I agree with Bev. I would love to see some data - but from the individual machines. Adding the data from the three machines together will be non-normal, as it will likely be multimodal.

For control, you would analyze each machine individually, however. Control is all you have, because nature will take over after that. So, if you are concerned about centering you need to look at each machine individually. You also need to understand what are the variables you can "adjust" to become centered.

For Field failure prediction, however, you will need the combined variation of all machines.

I think you are starting to mix these two things up.

And generally speaking how should I behave with non normal distribution cases?

The best answer is "it depends." If it is non-normal, you need to understand what distribution the process is, for starters. You should also have a good idea why it is non-normal. If it is multimodal, you need to understand why. Remember, when analyzing data variation you are analyzing total variation - the sum of all variation. If you have several significant variations, it will be multimodal - although each can be normal, or they can be combinations of normal and non-normal. That is why you need to reduce many of the variations to get to the process variation. That is why it is so important to make measurement variation (error) statistically insignificant. It may mask true process variation.

Some non-normal distributions can be controlled in a similar way to normal, especially if the distribution is a mean-centered one. But highly skewed or uniform distributions need different control mechanisms.

Many processes are non-normal. To find the distribution, I use Distribution Analyzer from www.variation.com. It is simple to use. If it is appropriate, you can analyze the data with transformations.

smedboy
11th July 2009, 02:01 PM
I'll post some data first thing moday. Have a good weekend all!where I can find some tutorials or real life examples(like minitab quality trainer) on SPC. I know the theory but I am kind of in trouble putting it to practice and things like ANOVA and similar could really help me in some cases.

Miner
11th July 2009, 04:29 PM
The data from the individual machines should be normal, in my mind, based on the phenomena you describe.
Not necessarily. I have worked with physical test results several times in the past. Many of them tend to be right skewed because you will get a smaller number of units with really high results.

bobdoering
11th July 2009, 11:48 PM
Not necessarily. I have worked with physical test results several times in the past. Many of them tend to be right skewed because you will get a smaller number of units with really high results.

As have I, for years when I was in R&D. And it is true, it is not necessarily normal, but I would expect it to be from the description.

You reminded me of a good point. Tensile testing (unlike length measurement) can be some of the most dubious testing, fraught with error or competing phenomena that contribute to the final failure force. Often times it was necessary to look at the nature of the curve, such as a slope transition (yield point) rather than the terminal failure data result, in order to get repeatable results. Skewed data would make me suspicious right off the bat, and I would go to the pull curve next to get an idea of how repeatable the actual test is. That is not to say it couldn't be skewed. If it is close to a physical barrier - such as zero, then it would be common. But it would be suspicious, at best, between 190 N and 400 N.

Trust me, I rarely expect normal distribution, but if it is to occur natural phenomena tend to be the most likely source - and I think that is the net result of what we have here.

smedboy
13th July 2009, 07:42 AM
Here some data!
About those links with tutorials or examples?

bobdoering
13th July 2009, 03:11 PM
I have attached the 'best fit distribution' data from Distribution Analyzer for review. The distributions are (from top to bottom) in the same order as the data (from left to right). Just a quick starting place...

You can see the probabilities are a bit different than assuming normal distribution (+/- 3 sigma).

smedboy
13th July 2009, 05:50 PM
What I didn't write is that the sheet with 3 tables was that the tables were refering to 3 different planting machine wih casual cylinder wiyh casual cavity. Then we did the second part using homogeneous groups.
I used some minitab analisis

bobdoering
13th July 2009, 11:01 PM
Just to help visualize the comparison of the 'best fit' curves and the 'normal' curve, I have added the normal curves to the data sheet. Look at the p values - the non-normal curves are a much better fit. If you combine 3 non-normal streams, the likelihood of getting a normal overall stream is unlikely - especially the more divergent the means. The curves are somewhat mean centralized, and more data may normalize the population - hard to say. But, with the data provided you can see the error in assuming normality - and that error can affect the validity of your decisions. You can see by the probabilities shown the over- or under-estimation of the probability by using a distribution that is not well fitted to the data.

Looking at the data for the two time period samples on the second set of data, the standard deviations indicate that the data is not significantly different.

smedboy
14th July 2009, 08:44 AM
Have you looked at sheet number 2 and 3?

bobdoering
14th July 2009, 09:21 AM
Have you looked at sheet number 2 and 3?

As far as Sheet 1, if you are looking for the range of probable variation, I would take the highest and lowest value from the 3 distributions - 153 to 647 (which happens to be fully contributed by machine 3, the worst machine.

On sheet 2, looking at the data for the two time period samples on the second set of data, the standard deviations indicate that the data is not significantly different. The value of the t-test supports that, at .004.

I did not see anything on sheet 3.

I get a feeling that more data is needed to solidify the distribution.

smedboy
19th July 2009, 05:15 PM
Thanks. I've been reading some stuff and with your help I how to handle data now...moe or less. Thanks again

Darius
23rd July 2009, 10:28 AM
IMHO the low amount of data points could make a gaussian distribution look "non normal" (even trying to determine the distribution with an histrogram could be tricky or misguideing).

I looked at the data and set #2 looks like it had a change in the process (a trend from the first 13 data points), that is something I would wory about.:2cents: