Interpreting Linear Regression Results from Minitab

S

stats_beginner

Hi,

I am trying to apply linear regression to a limited data set relating the number of injuries to the number of fatalities in building collapse. There are further variables (e.g. building type, extent of collapse etc) that will influence this relationship, but due to a lack of information I am not in a position to account for these yet.

I have performed a linear regression using Minitab on the full data set and got the following relationship (the minitab results and related plots are in Regression I.pdf attached):

Injuries = 14.6 + 1.20 Fatalities

However, the residual plots don't appear to met the normaillity assumptions (I think) so I have tried repeating the regression with observations 4 and 11 omitted (which were identified as having large standard residuals). This resulted in the following adjusted equation (with minitab results and related plots in Regression II.pdf attached):

Injuries - adjusted = 5.2 + 1.50 Fatalities - adjusted

Now the residual plots appear to be more 'normal' but two observations are still highlighted as having large standard residuals. This also increased the R-sq value from 67.1% to 71.6%.

Then as a separate attempt to improve the results I tried deleting observations 2 and 4 (both had a much larger number of fatalities than the rest of the data). This resulted in the following adjusted equation (with minitab results and related plots in Regression III.pdf attached):

Injuries - adjusted = 9.3 + 1.4 Fatalities - adjusted

In this case the residual plots better than for case I but not as good as case II. While, the R-sq value is reduced to 50.2% (the lowest of the three values).

From this I would consider regression II to be the best but I am wondering if I am missing something/interpreting the results incorrectly?

I know the data isn't great but I want to have an approximate relationship which I can improve as more data becomes available.

I'd appreciate any advice on this as my stats experience is extremely limited!

Thanks in advance,

V
 

Attachments

  • Regression I.pdf
    147.6 KB · Views: 136
  • Regression II.pdf
    149.1 KB · Views: 111
  • Regression III.pdf
    148.4 KB · Views: 114

Jen Kirley

Quality and Auditing Expert
Leader
Admin
Re: Interpretting linear regression results

Welcome to the Cove! :bigwave:

Are you a student? If so, is this an assignment from a textbook? if so, which one and which edition?
 
N

NumberCruncher

Re: Interpretting linear regression results

Hi Stats_Beginner

You are right, the data are not Normally distributed. You need to transform the data. There isn't "THE" correct transformation. A bit of trial and error suggests that square root of both sets of data works well.

I don't have Minitab, but I can guarantee that there will be a data transformation or data calculation menu that will allow you to create a two new columns with the square root of the data. You can then calculate a new regression curve which will fit far better.

NC
 
S

stats_beginner

Re: Interpretting linear regression results

Hi, Thanks for your replies.

Jennifer, I am not a student. I'm an engineer and this is a set of data I've collected. However, I've forgotten all of the stats I did in college.

NumberCruncher, ok so first problem seems to be that the data is not normally distributed. I have taken your advice and taken the square root of the data and reran the regression. This improves the residual plots considerably but the histogram still doesn't have the bell-curve shape I would expect to need (I've attached the results again). Can I say the data is normal with this histogram?

I'm going to try a few different transformations in the meantime but assuming the square root of the data gives the 'most normal' plots how should I proceed? Is it correct to delete different values until the histogram improves?

Thanks

V
 

Attachments

  • Regression IV.pdf
    146.6 KB · Views: 140

Miner

Forum Moderator
Leader
Admin
Re: Interpretting linear regression results

The normal probability plot looks very normal. It is a better test than the histogram. If you go into Options, you can set the Residuals plot to display the Andersen-Darling test results.
 
N

NumberCruncher

Re: Interpretting linear regression results

Hi stats_beginner

Actually, I don't think the results look at all bad. The normal probability plot is almost perfect and the residuals vs fitted value shows no obvious structure.

The histogram doesn't look too bad either. Remember, you only have 17 data points. That's not a lot to make a histogram out of.

Personally, I would go with the double square root transformation for the moment, but don't stick too dogmatically to it. Collect more data and see if the model still fits.

No, the fit isn't perfect, but you have a very messy data set (sorry about the rather sick pun, it wasn't intentional, honest!). The assumption behind a model is that one variable can be used to generate another. In the hard sciences, this is a very good assumption. The experiments can be controlled and the physical model assumes cause and effect.

With happenstance data such as this, there is no physical reason why 4 people being killed should cause 14 people to be injured. It's actually the incident that causes both, but the mathematical model ignores that.

As I am ignorant of this kind of study, I am slightly surprised that you get such a good fit with so few data.

No, you shouldn't keep deleting points to improve the fit.
1) without a valid statistical reason, this is bad practice.
2) The outliers may suggest something important. If a particular incident produced an unusually low or high number of injuries or fatalities, perhaps there is something in the detail of the incident report that tells you why this happened.


NC
 
S

stats_beginner

Re: Interpreting linear regression results

Super, thank you so much for your help.

NumberCruncher, I delighted to hear you're surprised at the goodness-of-fit. This is a first estimate at developing some sort of model to predict the number of injuries which might be expected following a building collapse, something it's quite difficult to get data for. I'm hoping to collect more information and eventually account for differences in the types of building, the severity of injuries observed etc. But as a general first guess the regression equation makes sense, my concern was that the data was appropriate to allow me use this equation.
 
B

bbarbee

Re: Interpreting linear regression results

My 2c is this: I'm not sure the X and Y are correctly stated. Don't injuries cause fatalities? I'm not a doctor, but...

And deleting data in search of normality is generally A Bad Thing to do unless you have a really good reason to believe that the particular points you don't like really don't belong in that model. (Maybe there was a convention of hardy yet injury prone people in the building, or perhaps they were all very old, etc)

More data will certainly help, but you can't knock down buildings to get it. The sqrt transform certainly makes the R^2 better, but you don't want to over-fit your data, either.
 
N

NumberCruncher

Re: Interpreting linear regression results

My 2c is this: I'm not sure the X and Y are correctly stated. Don't injuries cause fatalities? I'm not a doctor, but...

Hi bbarbee.

Sorry to contradict you but from a reporting point of view, there is no reason why injuries should cause fatalities. If you are injured and live you are recorded as injured. If you are injured and subsequently die, you are removed from the injured category and placed in the fatalities category.

It's safe to assume that just about everyone who dies in an incident, dies as a result of injuries, excepting the one person who died of an entirely unrelated heart attack at the exact moment of building collapse.

Let's take a simple numerical example.

There are 10 people involved in an incident. 10 are injured and 3 die. So that's 13 people then.

???

I stated in my previous post:

"No, the fit isn't perfect, but you have a very messy data set (sorry about the rather sick pun, it wasn't intentional, honest!). The assumption behind a model is that one variable can be used to generate another. In the hard sciences, this is a very good assumption. The experiments can be controlled and the physical model assumes cause and effect.

With happenstance data such as this, there is no physical reason why 4 people being killed should cause 14 people to be injured. It's actually the incident that causes both, but the mathematical model ignores that."


In physical reality, the building collapse causes both the injuries and the fatalities. The plot is simply the two dimensional projection of a three dimensional data set. The third dimension is something that I will call "Seriousness of building collapse". I have absolutely no idea what that is or how you measure it.

Yes, you can have overlapping categories. The classic example is infant mortality and child mortality. Child mortality (under 5 years) includes infant mortality (under 1 year).

This is a different point to the one above which is about correlation and cause.

NC
 

Jen Kirley

Quality and Auditing Expert
Leader
Admin
My mind was going, I think, to the same place as bbarbee. :agree1:

I am wondering what is the relationship between x and y. In regression analysis we are looking for a trended effect of something, or trying to predict something given data we know. Are you using regression analysis for predicting fatalities based on injuries?

I have found a site from Princeton called Interpreting Regression Output.

I hope this helps!
 
Top Bottom