Introduction – historical genesis of difference testing
Chance is part of nature, and whatever we did there is always a probabilistic aspect we necessarily should deal with. This was utterly clear to researchers at the beginning of 20th century, which were involved in the positivistic approach to natural sciences, especially agriculture (1). In this field, they had the opportunity to manipulate nature to produce changes, but at the same time they had the necessity to recognize whether what they observed was the consequence of an intervention or just a mere fluke (2). Indeed, they already knew that nature was variable, and Galton had shown that biological traits followed the laws of probability (3, 4). Therefore, researchers had the necessity to turn observations into evidences, and the laws of probability gave them means to assess uncertainty in findings.
It’s no surprise that Ronald Fisher, at the time he set the hypothesis testing, was employed in an experimental agricultural station and developed a pragmatic view of statistics aimed to rule out chance from empirical evidences (5, 6). In his framework, Fisher conceived the experimental research as a process attempting to induce, through the application of a factor, a certain difference in the observations taken with respect to an untreated condition*. (*NOTE: We would like to remark that this narrative description of Fisher’s achievements was intentionally oversimplified. For instance, we omitted to mention that “observed differences” were instead “average difference between observations”, through which it was possible to apply the Student’s method of statistical comparison using probability distributions). However, what he cleverly did was approaching the proof of such an experimental factor starting from the opposite perspective of no experimental effect. Although it might seem paradoxical, placing an hypothesis of null effect gave means for representing observed differences as erratic fluctuations produced by chance around an hypothetic value of no-difference equating zero (5). Of course, this explicitly recalled the probabilistic description of measurement errors that Gauss had given about half a century before and that was already familiar to researchers: the larger the difference from the expected outcome, the lower the probability of its random realization (7). Thus, the observation of large differences with an associated low probability was unlikely due to a random fluctuation, disproving in turn the null hypothesis. For practical reasons of experimental reproducibility, Fisher set the probability threshold for disproving randomness as low as < 0.05 (or less than 1 out of 20 trials), aiming to assure enough confidence when stating the alternative hypothesis of an experimental factor (5).
Contemporarily to the efforts of Fisher, Egon Pearson and mostly Jerzy Neyman refined the concepts of hypothesis testing (5). Neyman, which was concerned with mathematics and logic more than with experiments, formally showed that the rejection of a null hypothesis could be achieved only at the expense of a certain uncertainty regarding its truth. Such an uncertainty corresponded to the probability of erroneously rejecting a null hypothesis when there was no effect (he called Type I error), and equated the Fisher’s threshold for achieving significance (or α) (Table 1). Noteworthy, Neyman also showed that the gain of confidence of correctly rejecting a null hypothesis always happened at the expense of sensitivity of detecting an actual effect, a concept he termed statistical power. Indeed, he showed that when sensitivity increases, the probability of accepting a null hypothesis when there is an effect (he called Type II error or β, so that sensitivity was 1 - β) decreased (Table 1).
Table 1
The “decision making” approach of Neyman is usually seen as an alternative to the Fisher’s framework, although the two should be considered rather complementary (8). Indeed, Neyman’s approach helps to demonstrate that the hypothesis testing is indeed a difference testing. In fact, uncertainty is used to show whether a factor is strong enough to prove itself against chance. Thus, in the “classical framework” the burden of proof is on difference.
A trick of logic
So far, we have seen that the hypothesis testing was devised to prove an experimental hypothesis concerning a treatment or factor. Now, let us imagine that a researcher was concerned with the simple issue of replacing an old instrument with a new one. He would measure the same set of items using the old device and the new one, and then he would statistically compare the data. Could the investigator conclude the instruments were equal if he found no statistically significant difference in the compared items? Formally, he could not state this at all. In fact, the researcher observed some differences between the two sets of measures, but they were random to cause no significant changes or bias. In other words, any researcher could not recognize an instrument by the set of measures it produced, and vice versa. Therefore, he could just say that the two devices “agreed” in measuring the same objects, although he could not conclude they were literally “equal”.
However, in some situations it is necessary to state explicitly that two compared items not simply agree, but are equivalent. The reader should notice at this point that we used the term “equivalent” instead of “equal”, as the latter represents something that is practically unachievable due to the randomness that arises in any natural process. This is a fundamental passage, and must be carefully considered, in that “equivalent” means something that, although not strictly the same, is formally alike. Let us imagine that we compared several individuals using anthropometric measures. Supposing that they were all related, for instance brothers and cousins, reasonably there would be no statistically significant difference between them. However, an individual would be more like his brother than to his cousin, so that the issue would be showing which is the amount of difference in measures that makes any two individuals looking substantially alike. Hence, to show “equivalence” between individuals, their differences should be less than a certain amount considered a mark of significant dissimilarities. Thus, the issue would be twofold: first, we should accept the fact that missing to show a difference does not imply equivalence, and second, we should answer the question on how close is close enough to be considered (practically) equivalent. It is evident that for answering both it is necessary a trick of logic to bend the Fisher and Neyman’s framework to the needs of equivalence testing.
The rise of equivalence testing
The theoretic discussion on a possible testing procedure for equivalence is not new in statistics and was implicitly addressed by Eric Lehmann yet in 1959 (9). However, it was just around 1970s that the work of Wilfred Westlake on the statistical assessment of equivalent formulations of drugs made of it a true practical concern (10-13). He firstly recognized that whatever sufficiently large cohort of subjects would have shown as statistically significant even a negligible difference, an issue of excessive study sensitivity known as statistical overpowering. To control for β level (Table 1), Westlake devised a method relying on an interval of maximum acceptable difference between average responses of compared drugs (defined as –Δ; +Δ), within which the actual difference (indicated as δ) of equivalent drugs could stay. In other words, the method controlled the study sensitivity setting the largest effect size at which the actual difference was considered negligible. Hence, although not producing strictly the same response, the two drugs were considered practically interchangeable, turning out to be equivalent. The Westlake’s method concluded equivalence if the confidence interval around δ (a probabilistic measure of the true location of δ) rested entirely within –Δ; +Δ. Noteworthy, such an approach suffered for inflation of the actual probability level (the P-value) by which the equivalence was ruled out, and flawed the application of Westlake’s method and its analogues in real decision making on bioequivalence (14, 15).
David Rocke was among the ones who openly recognized that the Westlake’s method forced the classical framework of difference testing to control for β level, while it was devised to control for α instead (16). Hence, in 1983, he proposed to shift the burden of proof from difference to non-difference, turning upside down the perspective on the experimental hypothesis. In classical framework, equivalence was proven through non-difference as follows:
null hypothesis (H0) → non-difference: |δ| = 0 or |δ| ≤ |Δ|
alternative hypothesis (H1) → difference: |δ| > |Δ|.
Thus, he proposed the procedure below:
null hypothesis (H0) → non-equivalence: |δ| ≥ |Δ|
alternative hypothesis (H1) → equivalence: |δ| < |Δ|.
Therefore, what was Type II error in difference testing became Type I error in equivalence, allowing to easily control for the probability of erroneously accepting an inexistent difference (now corresponding to α), overcoming the limitations of Westlake’s method (Table 1).
Around the same years, Walter Hauck and Sharon Anderson advanced their procedure, which relied on the concept of “interval hypothesis” for testing equivalence as the experimental one placed under the alternative hypothesis (15). Let mS and mE represent the average response to a standard and experimental formulation of a drug, respectively, and A being the lower and B the upper (with B > A) boundary of the equivalence interval thereof. Then, per the Anderson and Hauck’s procedure, it is possible to state the hypothesis testing as follows:
null hypothesis (H0) → non-equivalence: mE - mS ≤ A or mE - mS ≥ B
alternative hypothesis (H1) → equivalence: A < mE - mS < B.
In fact, two sets of observations are said to be not equivalent if their average difference mE - mS = δ encroaches the equivalence limits, or |δ| ≥ (B - A). Now, for a parallel design (two independent groups without crossing effect), it is possible to build a statistical test using the frame of a Student’s two-sided t-test (Figure 1A):
in which the numerator represents the distance of the average differences form the center of the quivalence interval, NE and NS the size of experimental and standard group respectively, and S the pooled sample variance that in this case can be estimated as follows:
with SE and SS being the standard deviation of the experimental and standard group respectively. Then, significance for T can be found using a Student’s t distribution with ν = NE + NS - 2 degrees of freedom. Noteworthy, this method allows to properly size the study applying in the appropriate way the rules of power analysis. In fact, in the difference testing, the power refers to the uncertainty in rejecting a significant difference when there is actually one, while in equivalence it is the uncertainty in rejecting equivalence when there is no difference. Anderson and Hauck also proved their procedure to be the most powerful test for equivalence comparing to any confidence interval approach previously advanced (15). In other words, they showed that given a certain interval of acceptability for equivalence, their method produced the lowest rate of false negatives in the sense of erroneously rejected equivalent items (in their case drugs formulations).
The two one-sided tests (TOST)
In 1987, Donald Schuirmann discussed the power of an alternative procedure to test equivalence, which was based on an adaptation of the Westlake’s method to the null hypothesis of non-equivalence formulated in 1981 (17, 18). Let us consider the equivalence interval hypothesis as presented so far:
null hypothesis (H0) → non-equivalence: mE - mS ≤ A or mE - mS ≥ B
alternative hypothesis (H1) → equivalence: A < mE - mS < B.
Then, it can be rewritten “decomposing” the interval in two single hypotheses:
null hypothesis (H01) → inferiority: mE - mS ≤ A
alternative hypothesis (H11) → non-inferiority: mE - mS > A,
null hypothesis (H02) → superiority: mE - mS ≥ B
alternative hypothesis (H12) → non-superiority: mE - mS < B.
Thereby, it is possible to test both null hypotheses H01 and H02 applying to each of them a single sided test at the nominal level of significance α. Thus, to prove equivalence, it is necessary that both the hypothesis of inferiority and superiority be disproved simultaneously. Formally, the two one-sided tests (TOST) can be written as shown below for a two parallel groups design (Figure 1B and Appendix A for an example):
with N being the total sample size (sum of two groups). For each test, the significance level is found on a Student’s t distribution with ν = N - 2 degrees of freedom and S can be estimated per Eq.1.2. It must be noticed that the overall significance level of the procedure is 1-2α, in that each directional one-sided test (of inferiority and superiority) has an individual significance level of α (see Appendix B for example). The sample size is necessary for providing each single one-sided test with the adequate sensitivity. The method of Schuirmann, as a refinement of Westlake’s procedure, can be also seen under the point of view of the confidence interval (CI). In that we set α = 0.05, thus we can build a (1 - 2α) = 0.9 CI around δ choosing the appropriate value of t:
If an interval –Δ to +Δ is placed, then the 90% CI approach offers the possibility to show how well the alternative formulation fits the bioequivalence requirement (see Figure 2 and Appendix B for an example). Nowadays the TOST is considered the standard test for bioequivalence assessment (19).
Equivalence or agreement for laboratory medicine?
So far, we have seen how the equivalence testing stemmed from classical hypothesis testing and became the reference approach for bioequivalence problems. In 1995, Hartmann and co-authors cleverly addressed the limitations of difference testing in statistical procedures for method validation, invoking the adoption of principles of bioequivalence assessment (20). Noteworthy, in 2001 Kondratovich and co-authors recognized the suitability of equivalence testing for comparative studies in laboratory medicine (21). Notably, they also showed that for whatever regression model used to measure the agreement between paired observations, it was possible to reformulate its testing framework using a composite hypothesis of equivalence:
null hypothesis (H0) → non-equivalence: slope ≤ 1 - δ or slope ≥ 1 + δ
alternative hypothesis (H1) → equivalence: 1 - δ < slope < 1 + δ.
In a broader discussion, provided by Mascha and Sessler, it was shown that testing hypothesis of equivalence could be easily achieved through regression analysis (see Appendix C) (22). Thus, despite such evidences we might wonder why we still ignore equivalence testing in laboratory medicine, if we are often concerned with comparing devices and procedures. Do we really need it?
To answer, we should take into consideration two main aspects. First, one is of cultural kind, and concerns the way we have been raised in our probabilistic approach to scientific research. Of course, in laboratory medicine we have inherited the idea of experimental science with the burden of proof resting on difference. Hence, although we might set comparative studies to answer whether a new device or procedure could replace an old one, we still approach it by a classic perspective ignoring equivalence. Thus, we could figure out to easily fill the gap just by popularizing equivalence among biomedical researchers, expecting to see all the new studies approached through such an alternative way within the next ten years, as it just happened in pharmacology.
However, and this is the second aspect, equivalence could be uneasily handled in laboratory medicine. Equivalence cannot stand by itself, but demands the external support provided through the so-called equivalence interval to gain a meaning. Indeed, it is an aprioristic and conservative approach that demands to set an interval of allowance before proving the actual existence of any bias. On the contrary, agreement is a more liberal approach, in that it considers any difference just as an erratic and uninfluential factor unless it is proven otherwise. Thereby, it is concerned with bias only afterwards, giving room to more pragmatic considerations. Thus, apart from our statistical heritage, we should set about stating how much different is equivalent when comparing devices and procedures, and this could generate some confusion. An interesting example of this scenario was offered by Lung and co-authors which discussed the suitability of equivalence testing for assessing automated test procedures (23). In their manuscript, when presenting the analysis of data using TOST, authors advanced two distinct equivalence intervals, ± 2% for assay result and ± 3% for content uniformity. Translated into real laboratory medicine, we should invoke different equivalence intervals for different applications, like therapeutic drug monitoring and hormone testing, and different domains, like comparison between alternative analytical methods and comparison between standard and non-standard pre-analytical procedures. Therefore, any study on equivalence would have its own truth on equivalent methods depending on the criterion adopted, with considerable practical consequences. Let us imagine we set an equivalence interval of ± 5% for analytical results in therapeutic drug monitoring, basing on some considerations regarding the allowable uncertainty at the medical decision limit of the therapeutic window. How many methods would result equivalent within such a narrow range? Almost none if we consider immuno-enzymatic methods, very few if we consider high performance liquid chromatography (HPLC) methods and some more in case of liquid chromatography with tandem mass spectrometry (LC-MS/MS) methods. Thus, we should figure out a scenario in which immuno-enzymatic methods were banned from laboratories, but how could we perform urgent testing in the night shift when the LC-MS/MS facility is not operating? In this regard, Feng and co-authors adopted a criterion based on the 15% uncertainty, usually accepted for analytical methods validation above the lower limit of quantitation, to relax the acceptance criterion (24). However, as stated above, it should be assessed earlier, which one is more appropriate with respect to a given scenario. Hence, authoritative organizations in the field of laboratory medicine, like the International Federation of Clinical Chemistry, should discuss the topic before thinking the equivalence replacing the agreement in comparative studies. Otherwise, researchers should be concerned with showing the suitability of a certain criterion of equivalence, before assessing the equivalence itself.