I have a few queries related to interpretation of certain terms in Minitab related to Regression (GLM) and ANOVA. There are a few statistical concepts which I encountered in my research and I am taking the liberty of asking about them as well.

1) What is the difference between ANOVA in Regression and ANOVA in general as displayed in Minitab? In general how should one interpret ANOVA in regression?

2) When we look at ANOVA output in Minitab we also see an R-Sq value. Where does this come from?

3) I understand what residuals are but why should one do a normality test on residuals? Also I have read that post a regression test residulas should be plotted against predicted values. Why?

1) The ANOVA in Regression tests for the significance of the constant term and of the coefficients of each term in the regression model. For example, you have a simple linear regression of Y = B

0 + B

1X. If the ANOVA table shows B

0 has a p-value of 0.35, and B

1 has a p-value of 0.023, you should remove the constant term from your model leaving Y = B

1X. Multiple regression works the same way. Just remove terms with non significant p-values.

2) R-Sq shows how good the model is at explaining the variation. Even if all terms in your model have significant p-values, it may not be good at making predictions. R-Sq helps answer that question. It is a positive number that varies from 0 to 1. In simple terms, an R-Sq of .60 means the model will explain 60% of the variation.

There are three types of R-Sq. R-Sq has one weakness. If you keep adding terms to the model, it will get bigger. This is where R-Sq (adj) comes in. This measure penalizes you for adding terms. If you add a significant term it will increase. If you add a nonsignificant term, it will decrease.

Finally, there is R-Sq (pred). This measure evaluates how well the model will predict. If you have overfitted your model, you may have an R-Sq (adj) of .94, but an R-Sq (pred) of 0.4. This means the model explains the variation very well, but is worthless to predict new results.

3) One of the assumptions for regression is that the residuals (unexplained variation) have i) independence, ii) constant variance, and iii) normality. I'll explain this in reverse.

iii) The residuals should be normally distributed with a mean of zero. If they are not, there are a number of causes, but the most likely is that there is not a linear relationship. There may also be outliers in the data.

ii) There should be constant variance across the range of measurement. Sometimes the variance will increase as the measurement gets bigger. This usually occurs when the theoretical relationship (regression line) MUST pass through the zero point. If you are far from zero, this won't make much difference, but if you are close to zero it will.

i) The residuals should be independent. Test this by plotting the residuals in time sequence. This protects you from lurking variables and autocorrelated data.