Large data sets of continuous individual data - Estimated or actual deviation

P

pavelstat2019

Hi,

if we have large data sets of continuous individual data e.g. above > 500 value, should will still use estimated deviation or we can use normal deviation over the whole data to calculate UCL and LCL?

UCL / LCL based on estimated deviation (CL+/-2.66MR) in our case are more conservative than actual deviation over general population.

Thanks.
 

Bev D

Heretical Statistician
Leader
Super Moderator
you have posted this same question in an old thread - here is my response to that:

the phrase 'estimated standard deviation' is a misnomer. you must still calculate the control limits using the within subgroup standard deviation. this is how SPC works.

If the process is homogenous the within subgroup standard deviation adjusted by d2 or c4 will be a good estimate of the population standard deviation. (hence the term 'estimate' but using that term instead of within subgroup SD leads to many mis-interpretations.). IF the process is homogenous the subgroup averages will vary by the average +/- 3(within SD/c4)/square root of n. If the process is not homogenous the averages will not comply with the calculated limits. that is how SPC works. trying to calculate the population SD directly sidesteps this critical part of the process.

I am curious tho. what do you mean by having large data sets with n>500? are the data sets your 'subgroups'? or are they for different processes? can you describe what you are trying to do?
 
P

pavelstat2019

We are developing helpdesk dashboard that is based on analytical system that can easily process millions of records in seconds using any custom logic and dashboard visualization is so advanced that 300 days can be shown visually with no problem.

We have a number of metrics like "number of incidents" with historical data for 300 days. We can group those values in weeks / month however that will rather create a problem - day view is much better for operational purposes. "Number of incidents" is also obviously exist within minute / hour but on that scale this metric do not mean anything.

We want to create a control chart that in live mode day-by-day shows whether helpdesk is in control for the last 90 days to signal executives whether their reaction is required. Previously, there were situations in which information about "disaster" did not reach higher level in time.

We analyzed historical data (300 days) using standard formulas with CL = average, UCL / LCL = CL +/- 2.66 * MR and found several periodic peaks due some mismanagement causes however standard deviation calculated using normal formulas SQRT ( SUM (X-AVG(x))^2/(N-1))) it is slightly different from (Average Moving Range) / d2 with d2 = 1.128 for Range of span = 2. This difference is sufficient to exclude some of the peaks.

So, the question we have:
1. Can we use normal standard deviation for historical data and estimated one for 90 days view?​
2. Whether we should exclude from analysis the peaks that do not breach UCL/LCL based on normal formula of deviation (SQRT ( SUM (X-AVG(x))^2/(N-1)))) ?​
3. Going forward, shall we use UCL / LCL = CL +/- 2.66 * MR or we can use UCL / LCL = CL +/-3*SQRT ( SUM (X-AVG(x))^2/(N-1))) for 90 days view ?​
To get some idea about numbers.
CL = 67, MR = 23

UCL1 = CL + 2.66 MR = 129, est. standard deviation = 21
UCL2 = CL + 3 * SQRT ( SUM (X-AVG(x))^2/(N-1))) = 157, actual standard deviation = 30
 
Last edited by a moderator:
Top Bottom