# Finding optimum transformation equation for two sets of data

I

#### ilacatd

I want to do a 2-sample t-test to compare the burst pressure performance of two different products that are used in the same application. The data is not very normal: the Anderson-Darling p-value is 0.016 for one group and 0.096 for the other group. There are 30 data points in each group. The data is skewed to the right in both groups (most of the data is on the left side of the histogram, with a pretty long right tail). I would like to transform the data using Box-Cox transformation and Johnson Transformation in Minitab and then see which one has the highest resulting normality p-value and the highest correlation coefficient when plotted. I believe this will result in the most accurate t-test results ? is that right? To do the transformation correctly, it's necessary to use the same transformation equation for both groups in order for the t-test to be valid. If I use different equations I will end up with either diverging or converging data sets and means, and the result will be meaningless. Is there a way in Minitab (or any other tool) to find the optimum equation for two different data sets? You can?t just stack the data in one column, because they are not from the same population ? they come from two different products.

I ran a Box Cox on Group 1 and got an optimum lambda of 0.171, and on Group 2 I got an optimum lambda of 0.231. I need to use only one lambda. So do I use the average of 0.171 and 0.231 = 0.201? The resulting A-D p-values are 0.470 and 0.262, which is a lot better than the raw data, but I?m not sure whether it?s optimal.

With Johnson Transformation, it?s much more complicated to find one optimum equation because there are so many factors in the equation. For Group 1, I got 2.79632 + 1.50848 * Log( ( X - 17.2401 ) / ( 2265.63 - X ) ), and for Group 2 I got -0.0969158 + 1.32128 * Asinh( ( X - 190.918 ) / 121.167 ). The A-D p-values are 0.534 and 0.431 respectively, which is better than the Box-Cox transformation. But I can?t figure out how to find the optimum Johnson Transformation for both groups ? the equations are so different.

Does anyone have any suggestions?

I?m also aware that transformations can be tricky, and that it?s best to use a single type of transformation which fits the type of data in question and then stick with it, rather than finding the ?optimum? equation for every data set. For example, break strength of round cables varies with the square of the diameter of the cable, so SQRT(X) is probably the best transformation equation. Bacteria multiplying in a Petri dish would have exponential growth, so would that be log(X)?, survival time to failure might be Weibull, etc. My test is burst pressure of a vessel, so I?m not sure which transformation is best ? I generally get good results with natural log, but it?s not that clear. Does there need to be a further justification for doing a transformation besides simply to improve normality for purposes of doing a t-test? Any comments on transformations in general and in choosing the most appropriate transformation for an ongoing series of tests would be appreciated.

Thanks.

Ari Goldberg

#### Miner

##### Forum Moderator
The Johnson transform is the method of last resort. If Box-Cox works, use it instead. At n=30, a t-test is somewhat tolerant to non-normality, so using the average lambda should work fine.

You can also try using a nonparametric approach such as the Mann-Whitney test.

I

#### TWA - not the airline

Trusted Information Resource
What is the reason for perfoming the t-test? What kind of decision will be made when you find out that there is a significant difference in the mean burst pressure or what happens if you do not find this difference? This should influence your decisions, e.g. using a nonparametric test would save you from suspicions you may have fudged the results by tweaking the transformation parameters. And depending on your null hypothesis your transformations will have an impact on the alpha- and beta-error...

#### Bev D

##### Heretical Statistician
Super Moderator
I am one those heretics who opposes transformations. usually you can make valid conclusions without transforming your data. I have played with transforming vs not transforming and I've never found a situation where transforming was necessary. too often people reflex to transformation because their data doesn't appear to be Normally distributed. as Miner noted, most statistical tests are fairly robust to deviations form Normality. in addition for data that are simply NOT Normally distributed other techniques are simpler, easier and more compelling.

my other learning over the years is that the real issue isn't the non-Normality of the data but the inherent weakness of the study design given the nature of the process and the real question to be answered.

If you are going to resort to transformation, Miner's advice is correct (as always)

can you post your data? perhaps there is more to learn from the data than a simple t-test

#### TWA - not the airline

Trusted Information Resource
Bev, you're right. One should have an understanding what causes one's data to be non-normal before just applying a transformation and believing in the results that follow. If you do not know what actually happened you should use the data as is...

#### Miner

##### Forum Moderator
The data is skewed to the right in both groups (most of the data is on the left side of the histogram, with a pretty long right tail).
I agree with Bev. I prefer to deal with data in its naturally occurring distribution. I see too many people trying to transform non-Normal data without understanding why it isn't Normal. The reason is usually because the process was not stable, or they had mixed different process streams. In both cases, you should not transform the data, but address the instability and differences between streams. Transforming simply hides this information.

I suspect that you are dealing with a Largest Extreme Value distribution. While Minitab cannot model this particular distribution, it can model a Weibull distribution which can very closely match an LEV.

#### reynald

##### Quite Involved in Discussions
I want to do a 2-sample t-test to compare the burst pressure performance of two different products that are used in the same application. ...
Ari Goldberg

I can't really put it into words but somehow my engineering instincts tell me that I should put more weight on the left side of the distribution than the right side/tails.
This is if I get the picture right: Burst pressure test is more of a safety test right? So I should not be concerned with outliers on the right side (i.e. bursts at very high pressure). I will be more worried to those instances where bursts are at low pressures.
What I'm saying is that even if I can prove that the mean burst pressure are comparable for both products, that does not tell me which one is safer. One could have 25% instances that bursts at let's say 50psi while the other has only 5%. I would rather go for the second one even if the other exhibited higher mean pressure.

#### TWA - not the airline

Trusted Information Resource
reynald, I totally agree, the mean does not say much about the safety. That's why I wanted to know what the OP actually wants to accomplish with the t-test comparison of the means...