# Normality Assumption - Most Tests 'Assume' a Normal Distribution - t test statistic

K

#### KenK - 2009

Does anyone know why most tests of means involving the t-statistic assume that the individual data values are normally distributed?

The t-statistic in the most common tests of means involves only the mean and its standard error [(s/sqrt], and not the individual values themselves.

The Central Limit Theorem says that means will tend toward follow a normal distribution as the sample size gets larger, regardless of the distribution of the individual data values. So, assuming sample sizes are largish the distribution of the individual data values shouldn't really matter.

I started wondering if the sample standard deviation calculated from a highly skewed distribution resulted in the proper standard error for the respective distribution of the means.

I ran some simulations using Crystal Ball and was pleased to find that the sample standard deviation divided by SQRT does indeed provide a very good estimate of the distribution of the respective means. Pretty Cool.

I also ran some simulations to get a feel for how big the sample sizes need to be for normality of the means. Here are the results:

If the individual data distribution is symmetric (I used the Uniform distribution to represent a symmetric nonnormal distr.) then even a fairly small sample size (n=5) resulted in means which are quite normally distributed (10,000 runs not significantly different from normal using MINITAB's Anderson-Darling test).

If the individual data distribution is not symmetric (I used an Exponential distribution which is very skewed right) then even a fairly large sample size (n=50) resulted in means which are only marginally normally distributed (10,000 runs give fairly straight prob. plot, but MINITAB's Anderson-Darling test P-Value = 0.000)

G

#### Graeme

Re: Normality Assumption

Originally posted by KenK
Does anyone know why most tests of means involving the t-statistic assume that the individual data values are normally distributed? ...

Ken,

At the risk of being overly simplistic , I think the short answer to your initial question is because ... the t distribution is a property of the Normal distribution. This property was discovered early in the 20'th century and published in 1908 by W. S. Gosset, under the pseudonym "Student".

For very large samples (generally n > 100) the t distribution is essentially the same as the standardized Normal (Z) curve with a mean of 0. The t statistic was developed from work with small samples, however, and that is where it is usually used. (In this context, "small" is generally taken to mean a sample size of less than 30.) Typical uses include estimating the uncertainty in the mean, and testing to see if the means of two populations are equal. The main assumptions are that the populations are normally distributed (as you said in your question) and that you are using small samples and n - 1 degrees of freedom.

If the population is known to be non-normal (skewed), then the use of the t statistic is not appropriate. In those cases distribution-free tests should be used, such as Wilcoxon-Mann-Whitney, Kolmogorov-Smirnov and others.

Many statistics texts have good discussions of this statistic, and some even have the mathematical proofs if you are so inclined. If you want, I can give you a list of the ones I have on my bookshelf.

K

#### KenK - 2009

No, it doesn't really address the question I posed. The t-test statitsics focus on the distribution of xbar and the resulting t statistic, not x.

In one of the the simplest t-tests - a one-sample test of a mean, the statistic under consideration is the sample mean, xbar.

If we knew the population standard deviation, we could use the z statistic:

z = (xbar - Mu) / [Sigma/sqrt]

If I have a fairly symmetric distribution of x, with a moderately large sample size n, then I can show that the distribution of xbar will be VERY VERY close to normal with mean Mu (based upon my Null Hypothesis) and standard deviation [s/sqrt], where s is the sample standard deviation of the x values.

In turn, I can then say that z will have a Normal(0,1) distribution.

Since I typically won't know Sigma, I use the sample standard deviation, s, instead. Now, since I know xbar is VERY VERRY close to being distributed Normal(Mu,s/sqrt), the t statistic

t = (xbar - Mu) / [s/sqrt]

will then follow a central t distribution with (n - 1) degrees of freedom.

I can calculate the t statistic as above using sample data, and then compare the statistic I get with the t(n - 1) distribution. If my test statistic t is very extreme, then I would reject the null hypothesis.

I have just completed the one-sample t-test comparing a sample mean to a hypothetical value without ever needing to make an assumption about the distribution of x being Normal(Mu, Sigma).

So, back to my original question: why do many (most?) texts specify that assumption?

In my mind the only distribution assumption I needed was that xbar was normally distributed. Of course, there is also the assumption of independence.

G

#### Graeme

it was late on a dark and stormy night ...

Ken,

AHA! -- I see I must have misunderstood your original post.

On the other hand, I think what you are describing with the t-test is actually just a specific application of the general principle.

I just skimmed through several of my books again. All of the "statistics" books I have use a population, or an ungrouped random sample, in their examples. In that case, it is reasonable for them to state that the population must be normal, which is correct. They are dealing with individual points, not a collection of averages (subgroups).

However, my "QA/QC" books - which discuss the t statistic in the context of process description (control) charts - do not necessarily make the same statement. But they do not always state the assumptions, either.

The usual assumption is that the control chart is for averages, and each point on the chart is therefore the average of a small sample. Because of that, the Central Limit Theorem is working. The distribution of the sample averages will tend to be approximately normal regardless of the distribution they are drawn from. That is what allows of the t-test to work as you so clearly described.

So, to use the t-test the data must be normally distributed. But if the data consists of averages of small samples, then it will be approximately normal by the Central Limit Theorem, regardless of the distribution of the underlying population.

The main restriction I have noticed is that Besterfield says the sample size should be at least four. (Quality Control, 4th edition, 1994)

Graeme

K

#### KenK - 2009

Well at least one other person in this big world understands me (not including my wife, of course).

Thanks for chiming in.

Ken K.

D

#### Dave Strouse

Assumptions of normality in the underlying distribution

The assumptions of normality in the underlying distribution are based, I believe, on the theoretical derivation of the t-distribution. The proof , which I don't have at hand, proceeds from sampling from a N(mu,sigma) distribution and shows that this leads to the t-distribution which is some involved Gamma ratio.
However, more practical is , how important is this assumption to actual usage?

I think George Box and the Hunters address this to some degree in their "Statistics for Experimenters".

But the best discussion I've seen is from a 40 year old book by Rupert Miller called "Beyond ANOVA". I got it out of a library where I previously worked and don't have a copy myself. I checked AMAZON.com and they can get it.

Miller talks about the t-test robustness to the normality assumption. I think he says that for groups larger than 8 or 9, it really does not matter about the underlying distribution. It is much more important to be sure that the sample is independent and random. He gives some graphical means to check normality and also offers non parametric alternatives if you can't get the minimum sample size to be robust to normality. A handy little book, maybe I'll buy it someday.

Hope this helps.