Should Data be Normal before Computing Baselines?

P

patkim

While computing performance baselines, after you collect data, is it recommended to remove the outliers or should one compute the baselines including the outliers?

What is the correct or recommended approach? Could someone please elaborate on the correct approach?

For example say in a Software Development Company Effort Variance is collected from 40 milestones. Data is Not Normal because there are a few outliers say 4 data points are too skewed. Should the baseline be computed with all 40 data points or should I remove the outliers.

Thanks in advance.
 

Bev D

Heretical Statistician
Leader
Super Moderator
Don't test for Normality and don't remove so-called 'outliers'.
The only reason to remove a data point is that it is actually a fake value.


It might be helpful if you could explain what you are trying to do with the baseline...
For example - if you are trying to improve the deviation from milestone timelines then removing the 'skewed values' (I assume they are 'large deviations?) from the baseline is removing the very poor performance that you seek to improve. Focus on the causes of the deviation, improve the performance and plot the after performance values vs the before and see if the improvements were good enough. Fancy statistical calculations do not help you do this...
 

bobdoering

Stop X-bar/R Madness!!
Trusted Information Resource
Your baseline should be while your process is stable. It should only be normal when it is stable if the output variation is SUPPOSED to be normal, that is your variation is expected to be random, independent with its mean centered within its variation. If not, then it needs to be the distribution that you would expect it to be in its stable condition (e.g. non-normal such as skewed or uniform.)
 

Bev D

Heretical Statistician
Leader
Super Moderator
A baseline 'should' be stable for the calculation of capability indices (ugh!)

A baseline only needs to be relatively stable in order to calculate control limits for a control chart. We can still calculate 'good limits from bad data' if we understand the process and how control charts work. :) In this case we would exclude a few extreme data points form the control limit calculations but would keep the data in the chart.

If the OP is determining a baseline for other reasons (like understanding the actual process performance for budgeting or problem solving, etc.) the process doesn't have to be stable at all. And if this is the intent censoring data is horribly misguided and counterproductive.

Perhaps the OP could provide more information on what they are trying to do?
 

bobdoering

Stop X-bar/R Madness!!
Trusted Information Resource
The key point I was trying to make - going to the OP's question - is that 'normal' may have absolutely nothing to do with it. Just because you see a normal distribution, it may have nothing to do with the process. It may be all gage or measurement error - which tends to generate normal distributions that mask the true underlying process variation.

The distribution that models your current state may also be totally different from the distribution in its optimum state, too. But, neither may be normal. Curve fit and find a meaningful model.
 
Last edited:
Top Bottom