Further Thoughts on Replications, Ceiling Effects and Bullying
By Simone Schnall
The following is a reposting of Simone Schnall’s blog post from Saturday, May 31st. We posted her previous post last week and a good deal of discussion followed, so we are hoping to provide people with the opportunity to continue that conversation here. Also, consistency. Here is a link to the email exchange between all the relevant parties posted by Brian Nosek.
I hope that we are able to extend the benefit of the doubt to our colleagues, whatever our differences of opinion may be, because I think it’s reasonable to assume that we all have the best intentions in trying to improve the way our science works, a process that, if all goes well, will continue indefinitely.
Recently I suggested that the registered replication data of Schnall, Benton & Harvey (2008), conducted by Johnson, Cheung & Donnellan (2014) had a problem that their analyses did not address. I further pointed to current practices of “replication bullying.” It appears that the logic behind these points is unclear to some people, so here are some clarifications.
The Ceiling Effect in Johnson, Cheung & Donnellan (2014)
Compared to the original data, the participants in Johnson et al.’s (2014) data gave significantly more extreme responses. With a high percentage of participants giving the highest score, it is likely that any effect of the experimental manipulation will be washed out because there is no room for participants to give a higher score than the maximal score. This is called a ceiling effect. My detailed analysis shows how a ceiling effect could account for the lack of an effect in the replication data.
Some people have wondered why my analysis determines the level of ceiling across both the neutral and clean conditions. The answer is that one has to use an unbiased indicator of ceiling that takes into account the extremity of responses in both conditions. In principle, any of the following three possibilities can be true: Percentage of extreme responses is a) unrelated to the effect of the manipulation, b) is positively correlated with the effect of the manipulation or c), is negatively correlated with the effect of the manipulation. In other words, this ceiling indicator takes into account the fact it could go either way: Having many extreme scores could help, or hinder in terms of findings the effect.
Importantly, the prevalence of extreme scores is neither mentioned in the replication paper, nor do any of the reported analyses take into account the resulting severe skew of data. Thus, it is inappropriate to conclude that the replication constitutes a failure to replicate the original finding, when in reality the chance of finding an effect was compromised by the high percentage of extreme scores.
The Replication Authors’ Rejoinder
The replication authors responded to my commentary in a rejoinder. It is entitled “Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results.” In it, they accuse me of “criticizing after the results are known,” or CARKing, as Nosek and Lakens (2014) call it in their editorial. In the interest of “increasing the credibility of published results” interpretation of data evidently needs to be discouraged at all costs, which is why the special issue editors decided to omit any independent peer review of the results of all replication papers.
In their rejoinder Johnson et al. (2014) make several points:
First, they note that there was no a priori reason to suspect that my dependent variables would be inappropriate for use with college students in Michigan. Indeed, some people have suggested that it is my fault if the items did not work in the replication sample. Apparently there is a misconception among replicators that first, all materials from a published study must be used in precisely the same manner as in the original study, and second, as a consequence, that “one size fits all”, and all materials have to lead to the identical outcomes regardless of when and where you test a given group of participants.
This is a surprising assumption, but it is wide-spread: The flag-priming study in the “Many Labs” paper involved presenting participants in Italy, Poland, Malaysia, Brazil and other countries with the American flag, and asking them questions such as what they thought about President Obama, and the United States’ invasion of Iraq. After all, these were the materials that the original authors had used, so the replicators had no choice. Similarly, in another component of “Many Labs” the replicators insisted on using an outdated scale from 1983, even though the original authors cautioned that this makes little sense, because, well, times have changed. Considering that our field is about the social and contextual factors that shape all kinds of phenomena, this is a curious way to do science indeed.
Second, to “fix” the ceiling effect, in their rejoinder Johnson et al. (2014) remove all extreme observations from the data. This unusual strategy was earlier suggested to me by editor Lakens in his editorial correspondence and then implemented by the replication authors. It involves getting rid of a huge portion of the data: 38.50% from Study 1, and 44.00% of data in Study 2.
Even if sample size is still reasonable, this does not solve the problem: The ceiling effect indicates response compression at the top end of the scale, where variability due to the manipulation would be expected. If you get rid of all extreme responses then you are throwing away exactly that part of that portion of the data where the effect should occur in the first place. I have to yet find a statistician or a textbook that advocates removing nearly half of the data as an acceptable strategy to address a ceiling effect.
Now, imagine that I, or any author of an original study, had done this in a paper: It would be considered “p-hacking” if I had excluded almost half my data to show an effect. No reviewer would have approved of such an approach. By the same token, it should be considered “reverse p-hacking”: But the replication authors are allowed to do so and claim that there “still is no effect.”
Third, the replication authors selectively analyze a few items regarding the ceiling effect because they consider them “the only relevant comparisons.” But it does not make sense to cherry-pick single items: The possible ceiling effect can be demonstrated in my analyses that aggregate across all items. As in their replication paper, the authors further claim that only selected items on their own showed an effect in my original paper. This is surprising considering that it has always been standard practice to collapse across all items in an analyses, rather than making claims about any given individual item. Indeed, there is a growing recognition that rather than focussing on individual p-values one needs to instead consider overall effect sizes (Cumming, 2012). My original studies have Cohen’s d of .61 (Experiment 1), and .85 (Experiment 2).
Again, imagine that I had done this in my original paper, to only selectively report the items for which there was an effect, “because they were the only relevant comparisons” and ignored all others. I would most certainly be condemned for “p-hacking”, and no peer reviewer would have let me get away with it. Thus, it is stunning that the replication authors are allowed to make such a case, and it is identical to the argument made by the editor in his emails to me.
In summary, there is a factual error in the paper by Johnson et al. (2014) because none of the reported analyses take into account the severe skew in distributions: The replication data show a significantly greater percentage of extreme scores than the original data (and the successful direct replication by Arbesfeld, Collins, Baldwin, & Daubman, 2014). This is especially a problem in Experiment 2, which used a shorter scale with only 7 response options, compared to the 10 response options in Study 1. The subsequent analyses in the rejoinder, of getting rid of extreme scores, or to selectively focus on single items, are not sound. A likely interpretation is a ceiling effect in the replication data, which makes it difficult to detect the influence of a manipulation, and is especially a problem with short response scales (Hessling, Traxel & Schmidt, 2004). There can be other reasons for why the effect was not observed, but the above possibility should have been discussed by Johnson et al. (2014) as a limitation of the data that makes the replication inconclusive.
An Online Replication?
In their rejoinder the replication authors further describe an online study, but unfortunately it was not a direct replication: It lacked the experimental control that was present in all other studies that successfully demonstrated the effect. The scrambled sentences task (e.g., Srull & Wyer, 1979) involves underlining words on a piece of paper, as in Schnall et al. (2008), Besman et al. (2013) and Arbesfeld et al. (2014). Whereas the paper-based task is completed under the guidance of an experimenter, for online studies it cannot be established whether participants exclusively focus on the priming task. Indeed, results from online versions of priming studies systematically differ from lab-based versions (Ferguson, Carter, & Hassin, 2014). Priming involves inducing a cognitive concept in a subtle way. But for it to be effective, one has to ensure that there are no other distractions. In an online study it is close to impossible to establish what else participants might be doing at the same time.
Even more importantly for my study, the specific goal was to induce a sense of cleanliness. When participants do the study online they may be surrounded by mess, clutter and dirt, which would interfere with the cleanliness priming. In fact, we have previously shown that one way to induce disgust is to ask participants to sit at a dirty desk, where they were surrounded by rubbish, and this made their moral judgments more severe (Schnall, Haidt, Clore & Jordan, 2008). So a dirty environment while doing the online study would counteract the cleanliness induction. Another unsuccessful replication has surfaced that was conducted online. If somebody had asked me whether it makes sense to induce cleanliness in an online study, I would have said “no,” and they could have saved themselves some time and money. Indeed, the whole point of the registered replications was to ensure that replication studies had the same amount of experimental control as the original studies, which is why original authors were asked to give feedback.
I was surprised that the online study in the rejoinder was presented as if it can address the problem of the ceiling effect in the replication paper. Simply put: A study with a different problem (lack of experimental control) cannot fix the clearly demonstrated problem of the registered replication studies (high percentage of extreme scores). So it is a moving target: Rather than addressing the question of “is there an error in the replication paper,” to which I can show that the answer is “yes,” the replication authors were allowed to shift the focus to inappropriate analyses that “also don’t show an effect,” and to new data that introduce new problems.
Guilty Until Proven Innocent
My original paper was published in Psychological Science, where it was subjected to rigorous peer-review. A data-detective some time ago requested the raw data, and I provided them but never heard back. When I checked recently, this person confirmed the validity of my original analyses; this replication of my analyses was never made public.
So far nobody has been able to see anything wrong in my work. In contrast, I have shown that the analyses reported by Johnson et al. (2014) fail to take into account the severe skew in distributions, and I alerted the editors to it in multiple attempts. I shared all my analyses in great detail but was told the following by editor Lakens: “Overall, I do not directly see any reason why your comments would invalidate the replication. I think that reporting some additional details, such as distribution of the ratings, the interitem correlations, and the cronbachs alpha for the original and replication might be interesting as supplementary material.”
I was not content with this suggestion of simply adding a few details while still concluding the replication failed to show the original effect. I therefore provided further details on the analyses and said the following:
“Let me state very clearly: The issues that my analyses uncovered do not concern scientific disagreement but concern basic analyses, including examination of descriptive statistics, item distributions and scale reliabilities in order to identify general response patterns. In other words, the authors failed to provide the high-quality analysis strategy that they committed to in their proposal and that is outlined in the replication recipe (Brandt et al., 2014). These are significant shortcomings, and therefore significantly change the conclusion of the paper.” I requested that the replication authors are asked to revise their paper to reflect these additional analyses before the paper goes into print, or instead, that I get the chance to respond to their paper in print. Both requests were denied.
Now, let’s imagine the reverse situation: Somebody, perhaps a data detective, found a comparable error in my paper and claims that it calls into question all reported analyses, and alerted the editors. Or let’s imagine that there was not even a specific error but just a “highly unusual” pattern in the data that was difficult to explain. Multiple statistical experts would be consulted to carefully scrutinize the results, and there would be immediate calls to retract the paper, just to be on the safe side, because other researchers should not rely on findings that may be invalid. There might even be an ethics investigation to ascertain that no potential wrong-doing has occurred. But apparently the rules are very different when it comes to replications: The original paper, and by implication, the original author, is considered guilty until proven innocent, while apparently none of the errors in the replication matter; they can be addressed with reverse p-hacking. Somehow, even now, the onus still is on me, the defendant, to show that there was an error made by the accusers, and my reputation continues to be on the line.
Multiple people, including well-meaning colleagues, have suggested that I should collect further data to show my original effect. I appreciate their concern, because publicly criticizing the replication paper brought focused attention to the question of the replicability of my work. Indeed, some colleagues had advised against raising concerns for precisely this reason: Do not draw any additional attention to it. The mere fact that the work was targeted for replication implies that there must have been a reason for it to be singled out as suspect in the first place. This is probably why many people remain quiet when somebody claims a failed replication to their work. Not only do they fear the mocking and ridicule on social media, they know that while a murder suspect is considered innocent until proven guilty, no such allowance is made for the scientist under suspicion: She must be guilty of something – “the lady doth protest too much”, as indeed a commentator on the Science piece remarked.
So, about 10 days after I publicly raised concerns about a replication of my work, I have not only been publicly accused of various questionable research practices or potential fraud, but there have been further calls to establish the reliability of my earlier work. Nobody has shown any error in my paper, and there are two independent direct replications of my studies, and many conceptual replications involving cleanliness and morality, all of which corroborate my earlier findings (e.g., Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Gollwitzer & Melzer, 2012; Jones & Fitness, 2008; Lee & Schwarz, 2010; Reuven, Liberman & Dar, 2013; Ritter & Preston, 2011; Xie, Yu, Zhou & Sedikides, 2013; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010). There is an even bigger literature on the opposite of cleanliness, namely disgust, and its link to morality, generated by many different labs, using a wide range of methods (for reviews, see Chapman & Anderson, 2013; Russell & Giner-Sorolla, 2013).
But can we really be sure, I’m asked? What do those dozens and dozens of studies really tell us, if they have only tested new conceptual questions but have not used the verbatim materials from one particular paper published in 2008? There can only be one conclusive way to show that my finding is real: I must conduct one pre-registered replication study with identical materials, but this time using a gigantic sample. Ideally this study should not be conducted by me, since due to my interest in the topic I would probably contaminate the admissible evidence. Instead, it should be done by people who have never worked in that area, have no knowledge of the literature or any other relevant expertise. Only if they, in that one decisive study, can replicate the effect, then my work can be considered reliable. After all, if the effect is real, with identical materials the identical results should be obtained anywhere in the world, at any time, with any sample, by anybody. From this one study the truth will be so obvious that nobody has to even evaluate the results; the data will speak for themselves once they are uploaded to a website. This truth will constitute the final verdict that can be shouted off the roof-tops, because it overturns everything else that the entire literature has shown so far. People who are unwilling to accept this verdict may need to be prevented from “critiquing after the results are known” or CARKing (Nosek & Lakens, 2014) if they cannot live up to the truth. It is clear what is expected from me: I am a suspect now. More data is needed, much more data, to fully exonerate me. I must repent, for I have sinned, but redemption remains uncertain.
In contrast, I have yet to see anybody suggest that the replication authors should run a more carefully conceived experiment after developing new stimuli that are appropriate for their sample, one that does not have the problem of the many extreme responses they failed to report, indeed an experiment that in every way lives up to the confirmed high quality of my earlier work. Nobody has demanded that they rectify the damage they have already done to my reputation by reporting sub-standard research, let alone the smug presentation of the conclusive nature of their results. Nobody, not even the editors, asked them to fix their error, or called on them to retract their paper, as I would surely have been asked to do had anybody found problems with my work.
What is Bullying?
To be clear: I absolutely do not think that trying to replicate somebody else’s work should be considered bullying. The bullying comes in when failed replications are announced with glee and with direct or indirect implications of “p-hacking” or even fraud. There can be many reasons for any given study to fail; it is misleading to announce replication failure from one study as if it disproves an entire literature of convergent evidence. Some of the accusations that have been made after failures to replicate can be considered defamatory because they call into question a person’s professional reputation.
But more broadly, bullying relates to an abuse of power. In academia, journal editors have power: They are the gatekeepers of the published record. They have two clearly defined responsibilities: To ensure that the record is accurate, and to maintain impartiality toward authors. The key to this is independent peer-review. All this was suspended for the special issue: Editors got to cherry-pick replication authors with specific replication targets. Further, only one editor (rather than a number of people with relevant expertise, which could have included the original author, or not) unilaterally decided on the replication quality. So the policeman not only rounded up the suspects, he also served as the judge to evaluate the evidence that led him to round up the suspects in the first place.
Without my intervention to the Editor-in-Chief, 15 replications papers would have gone into print at a peer-review journal without a single finding having been scrutinized by any expert outside of the editorial team. For original papers such an approach would be unthinkable, even though original finding do not nearly have the same reputational implications as replication findings. I spent a tremendous amount of time to do the work that several independent experts should have done instead of me. I did not ask to be put into this position, nor did I enjoy it, but at the end there simply was nobody else to independently check the reliability of the findings that concern the integrity of my work. I would have preferred to spend all that time doing my own work instead of running analyses for which other people get authorship credit.
How do we know whether any of the other findings published in that special issue are reliable? Who will verify the replications of Asch’s (1946) and Schachter’s (1951) work, since they no longer have a chance to defend their contributions? More importantly, what exactly have we learned if Asch’s specific study did not replicate when using identical materials? Do we have to doubt his work and re-write the textbooks despite the fact that many conceptual replications have corroborated his early work? If it was true that studies with small sample sizes are “unlikely to replicate”, as some people falsely claim, then there is no point in trying to selectively smoke out false positives by running one-off large-sample direct replications that are treated as if they can invalidate a whole body of research using conceptual replications. It would be far easier to just declare 95% of the published research invalid, burn all the journals, and start from scratch.
Publications are about power: The editors invented new publication rules and in the process lowered the quality well below what is normally expected for a journal article, despite their alleged goal of “increasing the credibility of published results” and the clear reputational costs to original authors once “failed” replications go on the record. The guardians of the published record have a professional responsibility to ensure that whatever is declared as a “finding,” whether confirmatory, or disconfirmatory, is indeed valid and reliable. The format of the Replication Special Issue fell completely short of this responsibility, and replication findings were by default given the benefit of the doubt.
Perhaps the editors need to be reminded of their contractual obligation: It is not their job to adjudicate whether a previously observed finding is “real.” Their job is to ensure that all claims in any given paper are based on solid evidence. This includes establishing that a given method is appropriate to test a specific research question. A carbon copy of the materials used in an earlier study does not necessarily guarantee that this will be the case. Further, when errors become apparent, it is the editors’ responsibility to act accordingly and update the record with a published correction, or a retraction. It is not good enough to note that “more research is needed” to fully establish whether a finding is “real.”
Errors have indeed already become apparent in other papers in the replication special issue, such as in the “Many Labs” paper. They need to be corrected, otherwise false claims, incl. those involving “successful” replications, continue to be part of the published literature on which other researchers build their work. It is not acceptable to claim that it does not matter whether a study is carried out online or in the lab when this is not correct, or to present a graph that includes participants from Italy, Poland, Malaysia, Brazil and other international samples who were primed with an American flag to test the influence on Republican attitudes (Ferguson, Carter & Hassin, 2014). It is not good enough to mention in a footnote that in reality the replication of Jacowitz and Kahneman (1995) was not comparable to the original method, and therefore the reported conclusions do not hold. It is not good enough to brush aside original authors’ concerns that the replication of their work did not constitute an appropriate replication and therefore did not fully test the research question (Crisp, Miles, & Husnu, 2014; Schwarz & Strack, 2014).
The printed journal pages constitute the academic record, and each and every finding reported in those pages needs to be valid and reliable. The editors not only owe this to the original authors whose work they targeted, but also to the readers who will rely on their conclusions and take them at face value. Gross oversimplifications of which effects were “successfully replicated” or “unsuccessfully replicated” are simply not good enough when they do not even take into account whether a given method was appropriate to test a given research question in a given sample, especially when not a single expert has verified these claims. In short, replications are not good enough when they are held to a much lower standard than the very findings they end up challenging.
It is difficult to not get the impression that the format of the special issue meant that the editors sided with the replication authors, while completely silencing original authors. In my specific case, the editor even helped them make their case against me in print, by suggesting analyses to address the ceiling effect that few independent peer reviewers would consider acceptable, analyses that would have been unthinkable had I dared to present them as an original author. Indeed, he publicly made it very clear on whose side he is on.
Lakens has further publicly been critical of my research area, embodied cognition, because according to him, such research “lacks any evidential value.” He came to this conclusion based on a p-curve analysis of a review paper (Meier, Schnall, Schwarz & Bargh, 2012), in which Schnall, Benton & Harvey (2008) is cited. Further, he mocked the recent SPSP Embodiment Preconference at which I gave a talk, a conference that I co-founded several years ago. In contrast, he publicly praised the blog by replication author Brent Donnellan as “excellent”; it is the very blog for which the latter by now apologized because it glorifies failed replications.
Bullying involves an imbalance of power and can take on many forms. It does not depend on your gender, skin colour, whether you are a graduate student or a tenured faculty member. It means having no control when all the known rules regarding scientific communication appear to have been re-written by those who not only control the published record, but in the process also hand down the verdicts regarding which findings shall be considered indefinitely “suspect.”