Use of Non-Parametric Statistics and Non-Normal Data

My approach to Non-Parametric Statistics is:

  • I don't work with statistics enough for it to matter

    Votes: 2 13.3%
  • I always assume normality

    Votes: 1 6.7%
  • I probably ought to brush up on it

    Votes: 2 13.3%
  • If the data merits, I'll use it

    Votes: 10 66.7%

  • Total voters
    15

BradM

Leader
Admin
I have seen this topic pop up several times regarding normality of data. I thought I would bounce it off of everybody to see what your thoughts are.

Parametric statistics (say the t-test) are making inferences regarding the mean. If the population is normal, then the mean and median are the same. If a set of data is non-normal, then the mean and median are not the same. A non-parametric test will make inferences based on the median.

If a set of data is ‘reasonably’ normal (through graphing; although no data is truly normal), then yes, parametric statistics are significantly more efficient and the proper choice. But if you have a small sample size, then normality cannot be assumed.

Yes, the Central Limit Theorem (C.L.T.) does come into play. But the theorem deals with averages. If you have 17 samples, it is difficult in my mind (unless the data shows normality) to assume normality, simply based on the C.L.T. Sure, I can average 7000 times from those 17 numbers (Bootstrapping), and guess what, normality! But I still, in reality, only have my 17 numbers.

Parametric statistics are easier and way more efficient. But unless your data shows normality, non-parametric statistics should be utilized. NCSS shows both parametric and non-parametric tests. If the two tests give you the same results, then nothing is loss. If you get different results, a little more digging may be in order to determine which set of assumptions holds truer.

How does everyone deal with non-normality? Or do you?
 

Steve Prevette

Deming Disciple
Leader
Super Moderator
Re: Use of Non-Parametric statistics

How does everyone deal with non-normality? Or do you?

I use Statistical Process Control as my primary statistical tool, and since it is non-parametric (re: Tchybychev Inequality) then I am pretty well covered.

I have seen folks using sampling here for many analyses, and most just run ahead with the t-test, with some rather small data sets. I did help out one group that was having trouble with using the t-test and determined that their data fit an exponential distribution, and used that to analyze their results. That got them to an answer that was within their limits, and more defensible (and actually more conservative) than the t-test.
 

Tim Folkerts

Trusted Information Resource
Brad,

You make several interesting points. My overall reaction is that, yes, you need to be careful in case the data is far from normal, but often the data is "close enough" to normal that you can get away with using tests that are based on the normal distribution.

As Steve mentioned, if it is clear that the distribution isn't normal, then it can be valuable to explore the actual distribution. As always, it is important to look at the actual plots, not just the summary numbers like mean & standard deviation.

If the population is normal, then the mean and median are the same. If a set of data is non-normal, then the mean and median are not the same.

The 1st sentence sounds good, but any symmetric distribution (and even some non-symmetric distributions) will have mean = median.

But if you have a small sample size, then normality cannot be assumed.

Well, normality can be assumed :rolleyes: - I've certainly seen enough people do it to know it is possible. :frust: :lol:


Tim F
 

Statistical Steven

Statistician
Leader
Super Moderator
I have seen this topic pop up several times regarding normality of data. I thought I would bounce it off of everybody to see what your thoughts are.

Parametric statistics (say the t-test) are making inferences regarding the mean. If the population is normal, then the mean and median are the same. If a set of data is non-normal, then the mean and median are not the same. A non-parametric test will make inferences based on the median.

If a set of data is ‘reasonably’ normal (through graphing; although no data is truly normal), then yes, parametric statistics are significantly more efficient and the proper choice. But if you have a small sample size, then normality cannot be assumed.

Yes, the Central Limit Theorem (C.L.T.) does come into play. But the theorem deals with averages. If you have 17 samples, it is difficult in my mind (unless the data shows normality) to assume normality, simply based on the C.L.T. Sure, I can average 7000 times from those 17 numbers (Bootstrapping), and guess what, normality! But I still, in reality, only have my 17 numbers.

Parametric statistics are easier and way more efficient. But unless your data shows normality, non-parametric statistics should be utilized. NCSS shows both parametric and non-parametric tests. If the two tests give you the same results, then nothing is loss. If you get different results, a little more digging may be in order to determine which set of assumptions holds truer.

How does everyone deal with non-normality? Or do you?

The question for me is not IF the data is normal, but rather do I expect the data to come from a normal distribution. There are many factors including outliers that can contribute to a sample being non-normal (either graphically or using a test such as Andersen-Darling). Test such as t-test and ANOVA are still robust to slight departures from normality, so the mean and median do not have to be equal. A bigger problem than the normality assumption is when comparing two samples (t-test) the variances are not equal between the groups. Here is where I employ nonparametric statistics if I cannot find a variance stabilizing transformation.
 

bobdoering

Stop X-bar/R Madness!!
Trusted Information Resource
My preference is to use the statistics of the native distribution, rather than force feed it into another distribution (e.g. normal).
 

Bev D

Heretical Statistician
Leader
Super Moderator
hmmm. well I am rarely at the point of performing a Hypothesis test where I don't know something about the underlying process performance. In fact I teach that to perform a hypothesis test without that knowledge is like buying a lottery ticket. you might get lucky if you got somehow stumbled onto the right answer and your test demonstrated it. you might get unlucky and not have the right answer but your test says you do. this iseither a result of alpha risk or 'foolishly' designed experiment (biased, non-random, outside normal operating conditions, etc.)

The things I want to know first are:

what is the average?

what is the rough range of the Y, min to max? or the occurence rate if it's categorical data? how else do I know what the sample size should be? low occuring rates take different stats than moderate to high event rates. Also if my hypothesis test results don't span the full range of the normal variation in the Y (or at least most of it) then I'll know I did something wrong in the test and the results can't be fully trusted without further analysis.

How long does it take to go from min to max? how else will know how I need to spread my samples out to avoid a spurious association.
of course all of these can be answered with a trend chart fo the data.
I also want to know what the normal operating range of the suspect causal factor so I know where to set the low and the high values for the test.

Once I have the trend (from which I can get the distribution shape) and I know the basic science behind the process I can get a useable understanding of the distribution. (an example of the science is that coating thinkness will be bounded at zero and it will be skewed; tool wear tends toward a uniform looking distribution...)
 

bobdoering

Stop X-bar/R Madness!!
Trusted Information Resource
In brief: Typically run 50 to 100 pcs consecutively, plot them on a trend chart, generate a histogram, and see what it looks like. At that point it will be normal, or not. If it is a uniform distribution, it will look like a rectangle (the trend chart should look like a sawtooth curve). Otherwise, a Pearson analysis would help zero in on another appropriate distribution. Once you have identified the distribution, use its statistics.
 

BradM

Leader
Admin
In brief: Typically run 50 to 100 pcs consecutively, plot them on a trend chart, generate a histogram, and see what it looks like. At that point it will be normal, or not. If it is a uniform distribution, it will look like a rectangle (the trend chart should look like a sawtooth curve). Otherwise, a Pearson analysis would help zero in on another appropriate distribution. Once you have identified the distribution, use its statistics.

Ahhh, now we're getting somewhere. My question is: how many people do this? No distribution is perfectly normal. At best, it will be somewhat normal. So how much data comes even close to being reasonably normal enough to assume normality?
 

bobdoering

Stop X-bar/R Madness!!
Trusted Information Resource
Ahhh, now we're getting somewhere. My question is: how many people do this? No distribution is perfectly normal. At best, it will be somewhat normal. So how much data comes even close to being reasonably normal enough to assume normality?

Well, the best answer to any question is "it depends". :D I think it is a good idea to ponder whether a distribution can be expected to be normal. Normal distributions are a result of normal processes - and there is an emphasis on natural variation. One of Shewart's examples was tensile strength. Random natural variation caused by a myriad of influences - chemistry, crystalline structure, surface flaws, etc. Sure, I'd buy that. The example I like to use is a processing line of loaves of bread. The height of the loaves of bread is controlled by so many variables - proofing, yeast quality, humidity, accuracy of ingredient ratios, etc. The net result is a natural variation -most a particular height, some less, some more. If a process can be expected to stay at a particular "level", with some variation above and below that level - with NO operator intervention - until a special cause appears, it's normal. That is the "voice of the process". But, if you have to have someone adjust is to keep it there, then it is not normal - and you might as well start investigating what the distribution truly is. For precision machining I have found very little evidence of normal distributions - and plenty of evidence to the contrary! It is typically ruled by the uniform distribution.
 
Top Bottom