# What is the minimum Sample Size for Weibull Analysis

D

#### debun

I experimented with using only the first 5 data points from a sample set of 26 to do a Weibull analysis. The objective was to see how well these first 5 points predicted the distribution of the entire set. The first 5 points fit a 2 parameter Weibull and gave an R^2 values of 0.99. The second data set (remaining 21 points) changes to a 3 parameter Weibull with an R^2 value somewhere in the neighborhood of 0.976. So my questions are

1) How does one determine the minimum sample size to use for a Weibull? There are equations for calculations the sample size for a mean e.g n = Z^2 * sigma^2 / sampling_error^2. This seems to work well when your standard deviation is small but with failure data the standard deviation is typically large and on occasion not normally distributed. This equals huuuuuuge sample sizes using this method. Should I use a proportions method? The sample size is still large.

2) I would like to plot the remaining 21 data points over the CDF calculated from my first 5 points. The median ranking has a huge effect on this, so what is the correct way to do this? I have used 2 methods. Beta cumulative probability density function (first step in calculating the MR for any data set) and the second using the regression values for the set of 21 data (used to calculate the MR). The data is closest to the 5 point CDF is the beta cumulative probability density function but I intuitively think the regression method makes the most sense since. Is there a way to do this?

Here is my data set.
154173
171158
83431
201778
117578

192083
136262
149487
148009
98317
69798
94195
62548
103574
108364
132377
143047
85272
95760
214166
289237
161265
172490
99972
117440
89717

#### Steve Prevette

##### Deming Disciple
Staff member
Super Moderator
There is a reasonably good writeup on Weibull on wikipedia at http://en.wikipedia.org/wiki/Weibull_distribution

We've previously had discussions on the Cove about fitting the Weibull distribution to data. Unfortunately the factors need to be solved for in an optimization, which yes can be done in Excel, but is not a straight forward formula.

Just as it I had two points, plotted them, and fit a line, I'd get a 100% correlated fit, the same happens with the Weibull with only five points.

I think you should move ahead with your simulation and get a feel for yourself how Weibull behaves. My personal judgement call would be I'd like at least a dozen points to get a fairly good fit on a Weibull.

#### Miner

##### Forum Moderator
Staff member
What estimation method are you using? Least Squares or Maximum Likelihood? LSR is better for small sample sizes. MLE should have 100 or more failures, but no less than 50.

Regarding the 3-parameter Weibull, this should not be used unless there is a physical reason why there should be a threshold parameter. That is, if you have a threshold parameter of 3 months, is there a physical reason why you CANNOT have a failure in less than 3 months. If there is no physical reason, use of the 3-parameter Weibull is risky.

#### Statistical Steven

##### Statistician
Staff member
Super Moderator
Regarding the 3-parameter Weibull, this should not be used unless there is a physical reason why there should be a threshold parameter. That is, if you have a threshold parameter of 3 months, is there a physical reason why you CANNOT have a failure in less than 3 months. If there is no physical reason, use of the 3-parameter Weibull is risky.
Thank you! It can never be said enough that just blindly applying a distribution to data without a physical reason is very risky!

D

#### debun

What estimation method are you using? Least Squares or Maximum Likelihood? LSR is better for small sample sizes. MLE should have 100 or more failures, but no less than 50.

Regarding the 3-parameter Weibull, this should not be used unless there is a physical reason why there should be a threshold parameter. That is, if you have a threshold parameter of 3 months, is there a physical reason why you CANNOT have a failure in less than 3 months. If there is no physical reason, use of the 3-parameter Weibull is risky.

I used least squares to do my regression. The reason for 3 parameter was based upon my R^2 value as I stated. Are you implying that that decision should be based on some other factor/factors? This is also the reason that the first 5 were 2 param, and the remaining 3. I am only looking at the fit of the data.

Last edited by a moderator:

#### Miner

##### Forum Moderator
Staff member
More than implying... Distribution fitting is not easy. There is a lot of engineering judgment that goes with it. I use the following approach (in Minitab):
1. Use the probability plots to rule out obviously poor fits (i.e., curves and doglegs that may indicate mixtures of different failure modes, which must be modeled separately)
2. Review remaining distributions to identify the higher level distributions (e.g., 3-parameter Weibull), and determine both whether there this a physical reason for a location parameter (as I discussed previously) AND whether the additional parameter is significantly better than the base distribution. I usually eliminate the higher level distributions based on this.
3. Review the remaining distributions with highest equivalent fits, looking at the means and standard errors. You will typically be able to eliminate 1 or 2 more because these values are obviously very poor, usually extremely high).
4. I usually end up with 2, sometimes 3, distributions remaining that are essentially equivalent. I then look at what type of failure mode each is typically used to model and select based on that.
BTW, I split your data set and performed a test of whether the shape and scale parameters were statistically different, and the p values were > 0.4. The confidence intervals on a sample of 5 were quite wide.

J

#### JayWarner

The 'answers' and suggestions for this question are very, very good - especially:Include the engineering aspects in the analysis decisions. I would only add, "Plot the *&^\$ data!" If the data clearly doesn't fit the assessed curve fit well, something is wrong. Weibull analysis can be prone to this problem. As for comparing some of the data against the rest, I'd suggest that you take a structured sample - every 2nd or third observation in one group vs. the others. Taking the first 5 against the others leaves you wondering if perhaps the engineers & techs who did the work, worked out a few kinks in those first measurements, and they don't really reflect the defined population of the later data.

D

#### debun

"Plot the *&^\$ data!" Where do you think I got the R^2 values from?

D

#### debun

More than implying... Distribution fitting is not easy. There is a lot of engineering judgment that goes with it. I use the following approach (in Minitab):
1. Use the probability plots to rule out obviously poor fits (i.e., curves and doglegs that may indicate mixtures of different failure modes, which must be modeled separately)
2. Review remaining distributions to identify the higher level distributions (e.g., 3-parameter Weibull), and determine both whether there this a physical reason for a location parameter (as I discussed previously) AND whether the additional parameter is significantly better than the base distribution. I usually eliminate the higher level distributions based on this.
3. Review the remaining distributions with highest equivalent fits, looking at the means and standard errors. You will typically be able to eliminate 1 or 2 more because these values are obviously very poor, usually extremely high).
4. I usually end up with 2, sometimes 3, distributions remaining that are essentially equivalent. I then look at what type of failure mode each is typically used to model and select based on that.
BTW, I split your data set and performed a test of whether the shape and scale parameters were statistically different, and the p values were > 0.4. The confidence intervals on a sample of 5 were quite wide.
I'm not struggling with what distribution to use. My question still is how to determine the sample size since if you look at the first 5 alone the fit is good. How do you compare the first 5 to the remaining 21 since you have to determine the median ranking.

D

#### debun

What is the minimum sample size to generate a Weibull distribution?

One of the advantages of the Weibull is you can form a distribution with a much smaller sample size than say a histogram.
I experimented with using only the first 5 data points from a sample set of 26 to do a Weibull analysis. The first 5 points fit a 2 parameter Weibull and gave an R^2 values of 0.99. The second data set (remaining 21 points) changes to a 3 parameter Weibull with an R^2 value somewhere in the neighborhood of 0.976. So my questions are

1) What is the minimum sample size you need to accurately represent the population? For example from this sample size X the PDF is within some metric (std deviations, % etc) of the true PDF. Clearly R^2 value alone isn?t a good metric

2) I would like to plot the remaining 21 data points over the CDF calculated from my first 5 points. I have used 2 methods. Simply calculating the MR and plotting test cycle vs MR. The second using the Beta and Eta to calculate the CDF percent and plotting test cycle vs CDF percentile. Is there a good way to represent predicted CDF vs test data?

Here is my data set.
154173
171158
83431
201778
117578

192083
136262
149487
148009
98317
69798
94195
62548
103574
108364
132377
143047
85272
95760
214166
289237
161265
172490
99972
117440
89717