S
stats_beginner
Hi,
I am trying to apply linear regression to a limited data set relating the number of injuries to the number of fatalities in building collapse. There are further variables (e.g. building type, extent of collapse etc) that will influence this relationship, but due to a lack of information I am not in a position to account for these yet.
I have performed a linear regression using Minitab on the full data set and got the following relationship (the minitab results and related plots are in Regression I.pdf attached):
Injuries = 14.6 + 1.20 Fatalities
However, the residual plots don't appear to met the normaillity assumptions (I think) so I have tried repeating the regression with observations 4 and 11 omitted (which were identified as having large standard residuals). This resulted in the following adjusted equation (with minitab results and related plots in Regression II.pdf attached):
Injuries - adjusted = 5.2 + 1.50 Fatalities - adjusted
Now the residual plots appear to be more 'normal' but two observations are still highlighted as having large standard residuals. This also increased the R-sq value from 67.1% to 71.6%.
Then as a separate attempt to improve the results I tried deleting observations 2 and 4 (both had a much larger number of fatalities than the rest of the data). This resulted in the following adjusted equation (with minitab results and related plots in Regression III.pdf attached):
Injuries - adjusted = 9.3 + 1.4 Fatalities - adjusted
In this case the residual plots better than for case I but not as good as case II. While, the R-sq value is reduced to 50.2% (the lowest of the three values).
From this I would consider regression II to be the best but I am wondering if I am missing something/interpreting the results incorrectly?
I know the data isn't great but I want to have an approximate relationship which I can improve as more data becomes available.
I'd appreciate any advice on this as my stats experience is extremely limited!
Thanks in advance,
V
I am trying to apply linear regression to a limited data set relating the number of injuries to the number of fatalities in building collapse. There are further variables (e.g. building type, extent of collapse etc) that will influence this relationship, but due to a lack of information I am not in a position to account for these yet.
I have performed a linear regression using Minitab on the full data set and got the following relationship (the minitab results and related plots are in Regression I.pdf attached):
Injuries = 14.6 + 1.20 Fatalities
However, the residual plots don't appear to met the normaillity assumptions (I think) so I have tried repeating the regression with observations 4 and 11 omitted (which were identified as having large standard residuals). This resulted in the following adjusted equation (with minitab results and related plots in Regression II.pdf attached):
Injuries - adjusted = 5.2 + 1.50 Fatalities - adjusted
Now the residual plots appear to be more 'normal' but two observations are still highlighted as having large standard residuals. This also increased the R-sq value from 67.1% to 71.6%.
Then as a separate attempt to improve the results I tried deleting observations 2 and 4 (both had a much larger number of fatalities than the rest of the data). This resulted in the following adjusted equation (with minitab results and related plots in Regression III.pdf attached):
Injuries - adjusted = 9.3 + 1.4 Fatalities - adjusted
In this case the residual plots better than for case I but not as good as case II. While, the R-sq value is reduced to 50.2% (the lowest of the three values).
From this I would consider regression II to be the best but I am wondering if I am missing something/interpreting the results incorrectly?
I know the data isn't great but I want to have an approximate relationship which I can improve as more data becomes available.
I'd appreciate any advice on this as my stats experience is extremely limited!
Thanks in advance,
V