I’m a (now retired) mathematician who became interested in statistics in the late 1990’s. Since my university was seriously undersupplied with statisticians at that time, I quickly found most of my professional activity turning to statistics. Between seeking how best to learn and teach statistics, teaching it (both graduate and undergraduate), being asked to serve on students’ Ph.D. committees, and being asked questions about statistics by colleagues in other departments, I became aware that there was a lot of misuse of statistics in the literature in many fields. I believe that most of it is “innocent,” in the sense that someone uses a technique without really understanding its limitations, then that sets a precedent which is followed by others, until poor practices become entrenched, in textbooks as well as in practice.
 
Based on that experience and on Ioannidis’ 2005 PLoS Medicine article “Why Most Published Research Findings Are False,” when I retired in 2010, I decided to spend some of my time doing what I could to create awareness of this problem and of ways to improve the situation. This has included trying to follow the literature stimulated by Ioannidis’ work, attending journal clubs, offering a continuing education course (drawing participants from state agencies as well as academia), and a website. Among other things, I have been trying to follow the recent literature on the subject that has appeared in psychology journals, and more recently in psychology blogs.
 
A few weeks ago, I came across David Funder’s blog, and followed a link there to download the article “Improving the dependability of research …” that he coauthored and that appeared in Personality and Social Psychology Review in February. Since there was much that I thought was good about the article, but also several important points that I thought were missing, I wrote comments and sent them to him, with permission to share with his coauthors or on his blog. He replied suggesting that posting my comments on the SPSP blog would be appropriate, since SPSP sponsored the original article. The blog editor agreed, so here they are:
 
The Funder et al paper includes many relevant and important recommendations for improving the dependability of research in a variety of fields in addition to those mentioned in the title. It also does a good job of explaining why those recommendations are important. I especially appreciate the emphasis on “getting it right,” and the comment (p. 7) that “…we need to promote a climate that emphasizes ‘telling the whole story’ rather than ‘telling a good story’.”  So, in the spirit of encouraging “getting it right” and telling more of the whole story, I offer some suggestions that I believe would improve the suggestions in Funder et al.
 
I. More is needed on multiple testing. The statement (p. 6) that “conducting multiple tests of significance on a data set without statistical correction” is “widely regarded as questionable” is an important part of the “story” on multiple testing.1 But there’s much more than this to the story. In particular, there are two other places in the paper where multiple inference needed to be discussed.
 
For purposes of illustration in what follows, suppose five hypothesis tests are performed. If an overall significance level of .05 is desired, there are two ways of using the simple Bonferroni correction for multiple testing. One is to multiply each of the p-values for the five hypothesis tests by 5 to obtain “adjusted p-values,” which are then compared to the desired significance rate .05. For example, if the five (unadjusted) p-values (for tests 1, 2, 3, 4, 5 respectively) were .001, .04, .03, .08, .015, then the respective adjusted p-values would be .005, .2, .15, .4, .075, and only the first hypothesis test would be deemed significant.
 
The second way is to apply a significance rate of .05/5 = .01 to each of the five hypothesis tests. The example illustrates that this would achieve the same final result: only the first test would be deemed significant at the adjusted significance level .01.
 
The latter approach (using an adjusted Type I error rate for each test, rather than the overall Type I error rate of .05) is helpful in explaining why multiple inference needs to be taken into account in two more items discussed in the paper.
 
a. Power and sample size. The discussion in the paper (pp. 3 – 5) does mention the interplay of Type I error rate with power and sample size, and does say (p. 4), “The sample size should normally be justified based on the smallest effect of interest.” What is not mentioned, but is important in practice, is that, if multiple tests are involved, then the Type I error rate used to calculate sample size for each test needs to be the adjusted Type I error rate used when accounting for multiple testing. Using the adjusted Type I error rate will usually result in a larger estimate of sample size, so neglecting to use it is likely to result in underpowered studies when more than one inference is performed.
 
b.  Confidence intervals. Recommendation 2 (p. 5) says, “Report … 95% CIs for reported findings.” This fails to take into account multiple inference. Analogous to considering a simple Bonferroni correction as dividing the overall significance rate (e.g., .05) by the number of hypothesis tests performed, a simple Bonferroni approach to multiple confidence intervals is to divide 1 – (overall confidence level) by the number of CI’s calculated. For example, to achieve an overall confidence level of .05 when calculating CI’s for five different effects, the simple Bonferroni approach would require 99% CI’s for each effect. Presenting 95% CI’s would be misleading: they would be narrower than the 99% CI’s, thus giving the impression of greater precision than is warranted when estimating more than one parameter.
 
II. More attention to the role of model assumptions is needed. The sentence (p. 7), “In addition, it is critical that the statistical analyses used are appropriate for the questions and the nature of the data collected,” is perhaps a vague way of talking about model assumptions, but much more detail is needed.
 
Examples:
 
a. The discussion of p-values correctly states (p. 2), “The p-value is the conditional probability that  …”, but then goes on to state, “… that condition is that the relationship … in the population is precisely 0.” In fact, the condition in the conditional probability is a composite condition: One part of the condition is indeed that the relationship in the population is zero, but the other part is that the model assumptions of the hypothesis test are satisfied. One can’t usually be sure that model assumptions are satisfied, but good practice in doing and reporting research requires some discussion of whether or not they are plausible in the context; what are the most likely departures; and what robustness considerations might apply.
 
b. The discussion of power presents a similar situation: As stated (p. 3), power is indeed a conditional probability, but again the condition is a composite one: First, that “a true effect of the precisely specified size will not be detected under NHST” (as stated), but also that the model assumptions for the hypothesis test are satisfied.  So again, discussion of how well the model assumptions are satisfied, and how any departures might affect power and sample size considerations, are needed to “tell the whole story”.
 
c. The validity of confidence interval calculations also depends on model assumptions – so here once more, telling the whole story requires the type of discussion mentioned in (a) and (b).
 
Unfortunately, many statistics textbooks give short shrift to model assumptions2 –so this is a topic that needs to be included in the Recommendations for Educational Practice – perhaps as parts of Recommendations 1 and 3 (p.7) – in order to give the whole story.
 
III. It needs to be emphasized that, where statistics is involved, “getting it right” does not mean, “getting the right answer”.  Indeed, a big part of  “getting it right” is accepting that uncertainty is inherent in any subject requiring inferential statistics. For example:
  • Engaging in multiple inference increases the uncertainty in our conclusions, hence warrants adjustments to take that additional uncertainty into account.
  • One important value of using confidence intervals is to emphasize the uncertainty in our results.
  • Lack of fit of model assumptions increases the uncertainty in our results, so we need to be open about any lack of fit, in order to be up front about the uncertainty.
IV. I definitely appreciate the statement (p. 3) that “In areas with clear consensus that the measurement units are at least interval level … unstandardized effect sizes are preferred.” However, I believe that the statement that in other cases, “ standardized effect sizes are advisable,” might prompt neglect of raw effect sizes. I can see the argument for giving standardized effect sizes in these cases, but believe it is still important to give raw effect sizes. This is partly for the reasons listed at the end of the section on effect size, but also because the habit of considering raw effect sizes can promote thinking about what are good and no-so-good measures (and in particular, can help tell the “whole story” about the uncertainty arising from less-than-ideal measures), and can help in thinking about practical significance.
 
V. The section “Relations Among Type I Error, Standardized Effect Size, …” is a mixed bag. On the one hand, it makes the important points that “focusing solely on the observed p level is problematic because findings with equivalent p levels can have very different implications,” that “The problems with focusing exclusively on the observed p level are exacerbated when researchers overrely on the dichotomous distinction between ‘significant’ and ‘non-significant’ results,” that “the routine reporting of effect sizes and CIs would help prevent researchers from drawing misleading interpretations,” and that “… experiments with 10 or 20 participants per condition – which are not uncommon – are seriously underpowered.”
 
At the same time, the discussion relies on standardized effect sizes only and refers to Cohen’s “norms” of small, medium, and large effect sizes. I appreciate that Cohen introduced his simplified method of power/sample size analysis (based on standardized effect sizes and S, M, L sizes) in order to promote the practice of paying attention to power, and that indeed this “good story” has had that effect. But I believe that this area now needs more attention to “the whole story.”  In particular, the “whole story” involves the interrelation between Type I error rate, raw effect size, variability3, and sample size, which is related to the comments in item 3 above.  (Indeed, this section in some ways seems to veer away from some of the best aspects of the section on effect size.) In addition, thinking about variability and raw effect size individually (instead of conflated, as they are in standardized effect sizes) promotes thinking about possible better research designs as well as better measures.
 
VI. Although meta-analysis can be used well, it is also (like all statistical techniques) subject to poor use. Berk and Freedman (2003) give a good discussion (pp. 9 – 17 of preprint) of potential missteps in using meta-analysis. They propose as a better alternative (p. 17), “read the papers, think about them, and summarize them.” Thus, education in cautions about meta-analysis needs to be considered part of Recommendations 1 and 3 for Educational Practice (p. 7), in order the tell the whole story and “get it right.”
 
VII. Mentioning Bayesian alternatives to frequentist methods (p. 8) is very appropriate. But Bayesian methods require cautions, just as frequentist methods do. So giving a single reference is counter to the spirit of telling the whole story. References that might help round out the discussion in the single paper referenced include Gelman and commenters (2009); Raftery and discussants (1995); Andrews and Baguley (2013); Gelman and Shalizi, and discussants (2013)4.
 
Image credit: Proceed with Caution by Bart Maguire on Flickr

Notes:

 
1. I have found web demos and a cartoon (as outlined in Note 2 at http://www.ma.utexas.edu/blogs/mks/2014/06/28/beyond-the-buzz-part-iv-multiple-testing/) useful in helping students realize the problems inherent in multiple testing.
 
2. a. One introductory statistics textbook that does well in bringing attention to model assumptions is DeVeaux, Velleman, and Bock (2012).
 
b. Although it focuses on criminology, R.A. Berk and D.A. Freedman (2003) is also worthwhile reading to help start thinking seriously about model assumptions and how their lack of validity might alter the validity of interpretations of statistical calculations.
 
3. Best choice of word here is difficult; I have used “variability” to emphasize that not only does some measure of variability (e.g., standard deviation in simple cases) enter calculations of power or sample size, but that variability is a fact of life that needs to be respected in any field using statistics. “Variation” might be an equally good choice, but “standard deviation” is too narrowly focused, and “variance” would be worse, because it has units the square of what needs to be measured.
 
4. Bayesian methods could be useful in other ways in addition to being an alternative method of testing hypothesis. For example, Gelman (2014) points out (pp. 26 – 27) that “Classical statistics tends to focus on estimation or testing for a single parameter or low-dimensional vector, whereas Bayesian methods work particularly well when the goal is inference about a large number of uncertain quantities.” The red-state, blue-state problem he discusses is in some sense analogous to how a psychological phenomenon (e.g., stereotype susceptibility) might vary across populations, cohorts, and conditions. Thus a Bayesian approach, incorporating data from multiple studies, might be more helpful than classical hypothesis testing for studying such a complex situation. But again, Bayesian methods, like classical methods, require cautions; see, for example Hand (2014) and Welsh (2014).
 
References:
 
M. Andrews and T. Baguley (2013), Prior approval: The growth of Bayesian methods in psychology, British Journal of Mathematical and Statistical Psychology, 66, 1-7.
 
R.A. Berk and D.A. Freedman (2003). “Statistical assumptions as empirical commitments.” In Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2nd ed. Aldine de Gruyter , pp. 235–54. T. G. Blomberg and S. Cohen, eds; preprint available at http://www.stat.berkeley.edu/~census/berk2.pdf
 
DeVeaux, Velleman, and Bock (2012), Stats: Data and Models, Addison Wesley.
 
A. Gelman and commenters (2009), Why I don’t like so-called Bayesian hypothesis testing, Statistical Modeling, Causal Inference and Social Science (blog), 26 February 2009, http://andrewgelman.com/2009/02/26/why_i_dont_like/
 
A. Gelman and C. Shalizi, and discussants (2013), Philosophy and the practice of Bayesian statistics, British Journal of Mathematical and Statistical Psychology, 66, 8 – 80.
 
A Gelman (2014), How Bayesian analysis cracked the red-state, blue-state problem, Statistical Science, Vol. 29, No. 1, 26–35.
 
D. Hand (2014), Wonderful examples, but let’s not close our eyes, Statistical Science, Vol. 29, No. 1, 98 – 100.
 
A. Raftery and discussants (1995), Sociological Methodology vol 25, pp. 111 – 195. Includes Raftery: Bayesian Selection in Social Research (111 – 163); Gelman and Rubin: Avoiding Model Selection in Bayesian Social Research (165- 173); Hauser: Better Rules for Better Decisions (175 – 183); and Raftery rejoinder: Model Selection is Unavoidable in Social Research ( 185 – 195).
 
A. H. Welsh (2014), Discussion, Statistical Science, Vol. 29, No. 1, 101-102.
 

Martha K. Smith is Professor Emerita at The University of Texas at Austin, where she has been on the mathematics faculty since 1973. You can contact her at [email protected], and find more about her professional activities in recent years at http://www.ma.utexas.edu/users/mks/.