Null hypothesis significance testing (NHST), or statistical significance testing, is used to compare the sample means with hypothesized population means (Warner, 2021). Once a proposed value for population mean is determined, it is compared to a sample mean to see if the proposed population value is reasonable (Warner, 2021). A one-sample t test can be examined to demonstrate this process of making inferences about a population by evaluation of sample data (Warner, 2021). Depending on the information that is desired, options for the one-sample t test include a two-tailed or nondirectional test and a one-tailed or directional test (Warner, 2021). A directional significance test is used when the direction of difference is expected to be either more or less than the null hypothesis (Warner, 2021). Although dependence on NHST in science is high, many critiques have plagued the analysis and conclusions of a wide number of studies over many years (Anderson, 2020; Murphy, 2019; Wasserstein et al., 2019). Based on its prevalence, NHST must be understood and improved upon, remembering that “null hypothesis significance testing is essentially an assessment of whether or not your study was powerful enough to detect whatever effect you are studying, and it is very little else” (Murphy, 2019, p.2).
Significance Tests
Directional significance is demonstrated as useful in an example described by the data set provided by Warner (2021) regarding speed limits. Researchers that want to evaluate a null hypothesis that students are driving at the speed limit by sampling a portion of student drivers can use a directional significance test if they are seeking to disconfirm the null while alternatively hypothesizing that student drivers drive faster than the speed limit – one directional. Once the data is gathered, researchers seek to determine how far the sample mean is from the hypothesized population mean, otherwise known as a t ratio (Warner, 2021). Depending on the degrees of freedom and t values, the tail areas of the distribution – known as p values – are determined (Warner, 2021). If the data reveals a t value in these tail areas, then analysts might consider the sample mean to be an unusual outcome if the null hypothesis was true, and thus provides the impetus to reject the null hypothesis as being true (Warner, 2021). The determination of what values lie within the tail areas are based on the selection of an alpha (α) level (Warner, 2021).
In the speed limit data set example, with a test value of 35 mph, 8 degrees of freedom, one-tailed, 0.05 α level, and a sample mean of 39 mph, a t value of +1.966 is obtained (Warner, 2021). This means the p value is 0.04225, which being lower than α leads the researcher to reject the null hypothesis and consider the results statistically significant. However, it must be noted that driving 4 miles over the speed limit is not significant in real life and that if this analysis was performed with a two tailed test the resulting p value would not have been lower than α (Warner, 2021).
The Value of α and p
The value of α is often set to 0.05 or 0.01. The value 0.05 is used so often that unless stated otherwise, this is the value assumed in the study (Warner, 2021). Therefore, it is conventionally acceptable to reject the null hypothesis if the p value falls in a range smaller that 2.5% of either the upper or lower tails of the t distribution (Warner, 2021). This α level is considered to be low risk for making a Type I decision error (Warner, 2021). It is interesting to note that critical values for t distribution for the same α level will be lower for a one-tailed test than for a two-tailed test. In other words, the sample mean does not have to be as far away from the hypothesized mean in order to reject the null hypothesis with a one-tailed test (Warner, 2021). Because reporting a p value is truly reporting the risk for Type I error, it is desirable for pto be small – preferably smaller than 0.05. p values must not be misinterpreted as the probability that the null hypothesis is true (Anderson, 2020), do not give any information about effect size and do not provide information about the success or failure of a study (Warner, 2021).
No discussion about statistical significance can ignore the literature regarding the use of p values to determine statistical significance. Wasserstein et al. (2019) summarizes many of the alternatives to using p values in this manner. Although it was originally intended for statistical significance to indicate the need for further studies, it has become a marker of scientific importance, has led to erroneous beliefs and poor decision making – such as believing that a low enough p value speaks to effect, and encouraged biased reporting as well selective publishing (Wasserstein et al., 2019). In essence, p values should be treated as another statistical tool rather than as a determining factor of a fractionalized value system (Wasserstein et al., 2019).
Decision Errors
NHST is used to make the decision about either rejecting or not rejecting a null hypothesis (Warner, 2021). Whether the researcher chooses to reject or not reject, in the actual state of the world the null hypothesis is either actually true or actually false. This leads to two types of errors that could have been committed.
Type I Error
If the null is really a true statement and the researcher decides to reject it, this is a Type I error (Warner, 2021). For example, a researcher decides to test the null hypothesis that puppy pads are not used consistently by dogs, collects data that causes the researcher to assume that puppy pads are used consistently because it falls in the middle 95% of the tdistribution, and therefore rejects the null hypothesis. However, puppy pads truly are not used consistently by dogs and owners are often left with messes on their floor to pick up. The researcher took a 5% risk in rejecting a null that was true and failed. The magnitude of risk for this type of error is dependent upon the α level selected and adherence to NHST assumptions (Warner, 2021). A researcher must adhere closely to these assumptions and choose a low α level to minimize risk.
Type II Error
At times the number of cases in a study is too low in statistical power (Warner, 2021). Following the puppy pads example, this type of error is committed when the null is actually false (the puppy pads are used consistently by dogs), but the researcher fails to reject the null (determines that puppy pads are not used consistently by dogs). This type of error can be minimized by increasing the α level – a practice not commonly performed (Warner, 2021). Also beneficial for minimizing Type II error is increasing sample size and designing studies to maximize effect size (Warner, 2021). Therefore, if the null hypothesis is incorrectly reported as significant, this is a Type II error. Implications of this error include denying acceptance of alternative treatments and hindering further research. Conclusions from such results may be framed by the knowledge that Type II error may occur with underpowered studies (Warner, 2021).
References
Anderson, S. F. (2020). Misinterpreting p: The discrepancy between p values and the probability the null hypothesis is true, the influence of multiple testing, and implications for the replication crisis. Psychological Methods, 25(5), 596-609. https://doi.org/10.1037/met0000248
Murphy, K. (2019). Reducing our dependence on null hypothesis testing: A key to enhance the reproducibility and credibility of our science. SA Journal of Industrial Psychology, 45.https://doi.org/10.4102/sajip.v45i0.1717
The Scriptures. (2018). Institute for Scripture Research.
Warner, R. (2021). Applied statistics I: Basic bivariate techniques (3rd ed.). SAGE Publications, Inc.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond p < 0.05.The American Statistician 73(1), 1-19. https://doi.org/10.1080/00031305.2019.1583913
