# Understanding of Regression and ANOVA in Minitab

M

#### Mayank Trivedi

Hello Everyone,

I have a few queries related to interpretation of certain terms in Minitab related to Regression (GLM) and ANOVA. There are a few statistical concepts which I encountered in my research and I am taking the liberty of asking about them as well.

1) What is the difference between ANOVA in Regression and ANOVA in general as displayed in Minitab? In general how should one interpret ANOVA in regression?

2) When we look at ANOVA output in Minitab we also see an R-Sq value. Where does this come from?

3) I understand what residuals are but why should one do a normality test on residuals? Also I have read that post a regression test residulas should be plotted against predicted values. Why?

Best Regards,

Mayank

#### Stijloor

Staff member
Super Moderator
A Quick Bump!

Can someone help?

Thank you very much!!

D

#### Darius

I don't use Minitab but...
2) When we look at ANOVA output in Minitab we also see an R-Sq value. Where does this come from?

3) I understand what residuals are but why should one do a normality test on residuals? Also I have read that post a regression test residulas should be plotted against predicted values. Why?
r-sq is pearson's correlation coefficient and is the relationship of variance of the dependent variable against the variance of the independent (x) ones.

A site with an example

The residuals MUST have homocedasticity according to the theory

Wikipedia term

One can assume that, when heterocedasticity is found, the model is not right (you need to do transformations - functions to get rid of it), a way to find such behabiur is charting the residuals, I almost always chart it against sequence and against the calculated value.

M

#### Mayank Trivedi

Thank you for your response Darius.

I know about R-Sq in regression and I know ANOVA. My query though still remains unanswered.

When you look at the output of Regression in Minitab there also is ANOVA and it mentions Sum of squares for Regression and an Error. My query is what is this ANOVA. As I understand ANOVA, it is to compare means of 3 or more data sets. So how do they differ?

You mentioned that if there is heteroscedasticity you need to transform data. Why is this required? Is it ever a case where the R-Sq value is high but the residuals display non-normality? If yes why transform the residuals?

Best Regards,

Mayank

#### Steve Prevette

##### Deming Disciple
Staff member
Super Moderator
ANOVA is literally an analysis of variance. Yes, at first though you think about comparing two or more data sets and trying to conclude if they are the same or different, and ANOVA does this through comparing their variances. There is a residual term which expresses the difference between the variance within each data set and compared to the variance between each data set.

When I do a regression (be it linear or non-linear, single or multiple) I am trying to "explain" some component of variation in the Y data based upon X1, X2, . . . Xn. The hope is that I have explained the variation in the Y data through the regression, and only have left a set of residual data. ANOVA helps me to show how much of the variation is within the fit versus left over in the residuals.

The theory of regression (using the least-squares fit) assumes that you are able to explain all of the variation in the Y data, with the exception of a set of residual data, which are independent from each other, and randomly distributed as a Normal random variable with mean zero, and some variance (which is calculated by the ANOVA).

#### Statistical Steven

##### Statistician
Staff member
Super Moderator
Hello Everyone,

I have a few queries related to interpretation of certain terms in Minitab related to Regression (GLM) and ANOVA. There are a few statistical concepts which I encountered in my research and I am taking the liberty of asking about them as well.

1) What is the difference between ANOVA in Regression and ANOVA in general as displayed in Minitab? In general how should one interpret ANOVA in regression?
ANOVA is regression tells you if the model is signifcant, that is does the model explain more than just the random error of the data.

2) When we look at ANOVA output in Minitab we also see an R-Sq value. Where does this come from?

R-Sq is the coefficient of determination. It tells you how much of the variation is explained by the model. It is not a very good statistic as it is highly influenced by outliers and influential points.
3) I understand what residuals are but why should one do a normality test on residuals? Also I have read that post a regression test residulas should be plotted against predicted values. Why?
As has been stated, least squares regressions has an assumption that the residuals are normally distributed. Plotting against the dependent variable allows you to see if there are patterns that would indicate a transformation. See Daniel and Wood for a good explaination.
Best Regards,

Mayank
See answers above in Red. Why would you use software and do analysis if you do not understand what the output is telling you?

D

#### Darius

You mentioned that if there is heteroscedasticity you need to transform data. Why is this required? Is it ever a case where the R-Sq value is high but the residuals display non-normality? If yes why transform the residuals?
Leaving the technical aspects of regression (of course are important but..), the practical ones is to get a better estimator.

An example of this could be seen if you obtain a lineal regression of another model, (ie. X vs. 1/X), if the X variation range is small enough (let's say 0.25 to 1), a lineal regression could fit with a "nice" number (r2>0.8), but if you chart the residuals, you can see a pattern on them, a pattern that will not exist if you transform the x before the regression analisys is performed (with r2=1).

Sometimes, there is theory about that process that document the relationship of the variables, it can give you a start point.
Other times, eventhough you KNOW that the relationship exist, you can not see it because the variation of the dependent variable was too small, and the random variation of other variables came to the play, remember..., as far as you can, reduce environment noise of other variables and obtain the dependent variables at least at the full lenght of the forecasting (determination) range, a little more can be of help.:

#### Miner

##### Forum Moderator
Staff member
I have a few queries related to interpretation of certain terms in Minitab related to Regression (GLM) and ANOVA. There are a few statistical concepts which I encountered in my research and I am taking the liberty of asking about them as well.

1) What is the difference between ANOVA in Regression and ANOVA in general as displayed in Minitab? In general how should one interpret ANOVA in regression?

2) When we look at ANOVA output in Minitab we also see an R-Sq value. Where does this come from?

3) I understand what residuals are but why should one do a normality test on residuals? Also I have read that post a regression test residulas should be plotted against predicted values. Why?
1) The ANOVA in Regression tests for the significance of the constant term and of the coefficients of each term in the regression model. For example, you have a simple linear regression of Y = B0 + B1X. If the ANOVA table shows B0 has a p-value of 0.35, and B1 has a p-value of 0.023, you should remove the constant term from your model leaving Y = B1X. Multiple regression works the same way. Just remove terms with non significant p-values.

2) R-Sq shows how good the model is at explaining the variation. Even if all terms in your model have significant p-values, it may not be good at making predictions. R-Sq helps answer that question. It is a positive number that varies from 0 to 1. In simple terms, an R-Sq of .60 means the model will explain 60% of the variation.

There are three types of R-Sq. R-Sq has one weakness. If you keep adding terms to the model, it will get bigger. This is where R-Sq (adj) comes in. This measure penalizes you for adding terms. If you add a significant term it will increase. If you add a nonsignificant term, it will decrease.

Finally, there is R-Sq (pred). This measure evaluates how well the model will predict. If you have overfitted your model, you may have an R-Sq (adj) of .94, but an R-Sq (pred) of 0.4. This means the model explains the variation very well, but is worthless to predict new results.

3) One of the assumptions for regression is that the residuals (unexplained variation) have i) independence, ii) constant variance, and iii) normality. I'll explain this in reverse.

iii) The residuals should be normally distributed with a mean of zero. If they are not, there are a number of causes, but the most likely is that there is not a linear relationship. There may also be outliers in the data.

ii) There should be constant variance across the range of measurement. Sometimes the variance will increase as the measurement gets bigger. This usually occurs when the theoretical relationship (regression line) MUST pass through the zero point. If you are far from zero, this won't make much difference, but if you are close to zero it will.

i) The residuals should be independent. Test this by plotting the residuals in time sequence. This protects you from lurking variables and autocorrelated data.

Last edited:
A

#### Allattar

It is worth pointing out that there is no difference in the way ANOVA is calculated in Regression or ANOVA.

We are finding sum of squared distances.

How it calculates the sum of squares can be different depending on what you have asked it to do.

In a simple regression you are fitting a linear model to the results. The sum of squares in the error are found from the distances of the observed values to the least squares fitted line.

In an ANOVA the predictor is usually identified as a categorical value. The Sum of squares of the error is found from the squared distances of the observed values to the mean of each level.

If you have a predictor of just two levels, -1 and +1, then the difference between the two methods will be small. In regression a fitted line is generated based on the smallest sum of squares of the error to the line. In ANOVA the model will use the mean of the two levels and find the sum of squares of the error as the squared distance from the means.

Ill stop there before I end up going into degrees of freedom, mean square errors, random or nested factors.