Determination of Sample Size
Sample size determination is based on the comparison of the accuracy and precision of the two methods4 and is similar to that for testing hypotheses about average differences in the former case and variance ratios in the latter case, but the meaning of some of the input is different. The first component to be specified is , the largest acceptable difference between the two methods that, if achieved, still leads to the conclusion of equivalence. That is, if the two methods differ by no more than , they are considered acceptably similar. The comparison can be two-sided as just expressed, considering a difference of in either direction, as would be used when comparing means. Alternatively, it can be one-sided as in the case of comparing variances where a decrease in variability is acceptable and equivalency is concluded if the ratio of the variances (new/current, as a proportion) is not more than 1.0 + . A researcher will need to state based on knowledge of the current method and/or its use, or it may be calculated. One consideration, when there are specifications to satisfy, is that the new method should not differ by so much from the current method as to risk generating out-of-specification results. One then chooses to have a low likelihood of this happening by, for example, comparing the distribution of data for the current method to the specification limits. This could be done graphically or by using a tolerance interval, an example of which is given in Appendix E. In general, the choice for must depend on the scientific requirements of the laboratory.
The next two components relate to the probability of error. The data could lead to a conclusion of similarity when the methods are unacceptably different (as defined by ). This is called a false positive or Type I error. The error could also be in the other direction; that is, the methods could be similar, but the data do not permit that conclusion. This is a false negative or Type II error. With statistical methods, it is not possible to completely eliminate the possibility of either error. However, by choosing the sample size appropriately, the probability of each of these errors can be made acceptably small. The acceptable maximum probability of a Type I error is commonly denoted as and is commonly taken as 5%, but may be chosen differently. The desired maximum probability of a Type II error is commonly denoted by . Often, is specified indirectly by choosing a desired level of 1 – , which is called the “power” of the test. In the context of equivalency testing, power is the probability of correctly concluding that two methods are equivalent. Power is commonly taken to be 80% or 90% (corresponding to a of 20% or 10%), though other values may be chosen. The protocol for the experiment should specify , , and power. The sample size will depend on all of these components. An example is given in Appendix E. Although Appendix E determines only a single value, it is often useful to determine a table of sample sizes corresponding to different choices of , , and power. Such a table often allows for a more informed choice of sample size to better balance the competing priorities of resources and risks (false negative and false positive conclusions).