An Experience with a Registered Replication Project

Jun 23, 2015 BY Simone Schnall

This post originally appeared here and is re-posted in its entirety below. Brent Donnellan, one of the authors of the attempted replication, has written a related post here.

Recently I was invited to be part of a “registered replication” project of my work. It was an interesting experience, which in part was described in an article in Science. My commentary describing specific concerns about the replication paper is available here.

Some people have asked me for further details. Here are my answers to specific questions.

Question 1: “Are you against replications?”

I am a firm believer in replication efforts and was flattered that my paper (Schnall, Benton & Harvey, 2008) was considered important enough to be included in the special issue on “Replications of Important Findings in Social Psychology.” I therefore gladly cooperated with the registered replication project on every possible level: First, following their request I promptly shared all my experimental materials with David Johnson, Felix Cheung and Brent Donnellan and gave detailed instructions on the experimental protocol. Second, I reviewed the replication proposal when requested to do so by special issue editor Daniel Lakens. Third, when the replication authors requested my SPSS files I sent them the following day. Fourth, I offered the replication authors to analyze their data within two week’s time when they told me about the failure to replicate my findings. This offer was declined because the manuscript had already been submitted. Fifth, when I discovered the ceiling effect in the replication data I shared this concern with the special issue editors, and offered to help the replication authors correct the paper before it goes into print. This offer was rejected, as was my request for a published commentary describing the ceiling effect.

I was told that the ceiling effect does not change the conclusion of the paper, namely that it was a failure to replicate my original findings. The special issue editors Lakens and Nosek suggested that if I had concerns about the replication, I should write a blog; there was no need to inform the journal’s readers about my additional analyses. Fortunately Editor-in-Chief Unkelbach overruled this decision and granted published commentaries to all original authors whose work was included for replication in the special issue.

Of course replications are much needed and as a field we need to make sure that our findings are reliable. But we need to keep in mind that there are human beings involved, which is what Danny Kahneman’s commentary emphasizes. Authors of the original work should be allowed to participate in the process of having their work replicated. For the Replication Special Issue this did not happen: Authors were asked to review the replication proposal (and this was called “pre-data peer review”), but were not allowed to review the full manuscripts with findings and conclusions. Further, there was no plan for published commentaries; they were only implemented after I appealed to the Editor-in-Chief.

Various errors in several of the replications (e.g., in the “Many Labs” paper) became only apparent once original authors were allowed to give feedback. Errors were uncovered even for successfully replicated findings. But since the findings from “Many Labs” were already heavily publicized several months before the paper went into print, the reputational damage for some people behind the findings already started well before they had any chance to review the findings. “Many Labs” covered studies on 15 different topics, but there was no independent peer review of the findings by experts in those topics.

For all the papers in the special issue the replication authors were allowed the “last word” in the form of a rejoinder to the commentaries; these rejoinders were also not peer-reviewed. Some errors identified by the original authors were not appropriately addressed so they remain part of the published record.

Question 2: “Have the findings from Schnall, Benton & Harvey (2008) been replicated?”

We reported two experiments in this paper, showing that a sense of cleanliness leads to less severe moral judgments. Two direct replications of Study 1 have been conducted by Kimberly Daubman and showed the effect (described on Psych File Drawer: Replication 1, Replication 2). These are two completely independent replications that were successfully carried out at Bucknell University. Further, my collaborator Oliver Genschow and I conducted several studies that also replicated the original findings. Oliver ran the studies in Switzerland and in Germany, whereas the original work had been done in the United Kingdom, so it was good to see that the results replicated in other countries. These data are so far unpublished.

Importantly, all these studies were direct replications, not conceptual replications, and they provided support for the original effect. Altogether there are seven successful demonstrations of the effect, using identical methods: Two in our original paper, two by Kimberly Daubman and another three by Oliver Genschow and myself. I am not aware of any unsuccessful studies, either by myself or others, apart from the replications reported by Johnson, Cheung and Donnellan.

In addition, there are now many related studies involving conceptual replications. Cleanliness and cleansing behaviors, such as hand washing, have been shown to influence a variety of psychological outcomes (Cramwinckel, De Cremer & van Dijke, 2012; Cramwinckel, Van Dijk, Scheepers & Van den Bos, 2013; Florack, Kleber, Busch, & Stöhr, 2014; Gollwitzer & Melzer, 2012; Jones & Fitness, 2008; Kaspar, 2013; Lee & Schwarz, 2010a; Lee & Schwarz, 2010b; Reuven, Liberman & Dar, 2013; Ritter & Preston, 2011; Xie, Yu, Zhou & Sedikides, 2013; Xu, Zwick & Schwarz, 2012; Zhong & Liljenquist, 2006; Zhong, Strejcek, Sivanathan, 2010).

Question 3: “What do you think about “pre-data peer review?”

There are clear professional guidelines regarding scientific publication practices and in particular, about the need to ensure the accuracy of the published record by impartial peer review. The Committee on Publications Ethics (COPE) specifies the following principles of Transparency and Best Practice in Scholarly Publishing: “Peer review is defined as obtaining advice on individual manuscripts from reviewers expert in the field who are not part of the journal’s editorial staff.” Peer review of only methods but not full manuscripts violates internationally acknowledged publishing ethics. In other words, there is no such thing as “pre-data peer review.”

Editors need to seek the guidance of reviewers; they cannot act as reviewers themselves. Indeed, “One of the most important responsibilities of editors is organising and using peer review fairly and wisely.” (COPE, 2014, p. 8). The editorial process implemented for the Replication Special Issue went against all known publication conventions that have been developed to ensure impartial publication decisions. Such a breach of publication ethics is ironic considering the stated goals of transparency in science and increasing the credibility of published results. Peer review is not censorship; it concerns quality control. Without it, errors will enter the published record, as has indeed happened for several replications in the special issue.

Question 4: “You confirmed that the replication method was identical to your original studies. Why do you now say there is a problem?”

I indeed shared all my experimental materials with the replication authors, and also reviewed the replication proposal. But human nature is complex, and identical materials will not necessarily have the identical effect on all people in all contexts. My research involves moral judgments, for example, judging whether it is wrong to cook and eat a dog after it died of natural causes. It turned out that for some reason in the replication samples many more people found this to be extremely wrong than in the original studies. This was not anticipated and became apparent only once the data had been collected. There are many ways for any study to go wrong, and one has to carefully examine whether a given method was appropriate in a given context. This can only be done after all the data have been analyzed and interpreted.

My original paper went through rigorous peer-review. For the replication special issue all replication authors were deprived of this mechanism of quality control: There was no peer-review of the manuscript, not by authors of the original work, nor by anybody else. Only the editors evaluated the papers, but this cannot be considered peer-review (COPE, 2014). To make any meaningful scientific contribution the quality standards for replications need to be at least as high as for the original findings. Competent evaluation by experts is absolutely essential, and is especially important if replication authors have no prior expertise with a given research topic.

Question 5: “Why did participants in the replication samples give much more extreme moral judgments than in the original studies?”

Based on the literature on moral judgment, one possibility is that participants in the Michigan samples were on average more politically conservative than the participants in the original studies conducted in the UK. But given that conservatism was not assessed in the registered replication studies, it is difficult to say. Regardless of the reason, however, analyses have to take into account the distributions of a given sample, rather than assuming that they will always be identical to the original sample.

Question 6: “What is a ceiling effect?”

A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7”, this suggests that they might have given a higher response (e.g., “8” or “9”) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.

Research materials have to be designed such that they adequately capture response tendencies in a given population, usually via pilot testing (Hessling, Traxel, & Schmidt, 2004). Because direct replications use the materials developed for one specific testing situation, it can easily happen that materials that were appropriate in one context will not be appropriate in another. A ceiling effect is only one example of this general problem.

Question 7: “Can you get rid of the ceiling effect in the replication data by throwing away all extreme scores and then analysing whether there was an effect of the manipulation?”

There is no way to fix a ceiling effect. It is a methodological rather than statistical problem, indicating that survey items did not capture the full range of participants’ responses. Throwing away all extreme responses in the replication data of Johnson, Cheung & Donnellan (2014) would mean getting rid of 38.50% of data in Study 1, and 44.00% of data in Study 2. This is a problem even if sample sizes remain reasonable: The extreme responses are integral to the data sets; they are not outliers or otherwise unusual observations. Indeed, for 10 out of the 12 replication items the modal (=most frequent) response in participants was the top score of the scale. Throwing away extreme scores gets rid of the portion of the data where the effect of the manipulation would have been expected. Not only would removing almost half of your sample be considered “p-hacking,” most importantly, it does not solve the problem of the ceiling effect.

Question 8: “Why is the ceiling effect in the replication data such a big problem?”

Let me try to illustrate the ceiling effect in simple terms: Imagine two people are speaking into a microphone and you can clearly understand and distinguish their voices. Now you crank up the volume to the maximum. All you hear is this high-pitched sound (“eeeeee”) and you can no longer tell whether the two people are saying the same thing or something different. Thus, in the presence of such a ceiling effect it would seem that both speakers were saying the same thing, namely “eeeeee”.

The same thing applies to the ceiling effect in the replication studies. Once a majority of the participants are giving extreme scores, all differences between two conditions are abolished. Thus, a ceiling effect means that all predicted differences will be wiped out: It will look like there is no difference between the two people (or the two experimental conditions).

Further, there is no way to just remove the “eeeee” from the sound in order to figure out what was being said at that time; that information is lost forever and hence no conclusions can be drawn from data that suffer from a ceiling effect. In other words, you can’t fix a ceiling effect once it’s present.

Question 9: “Why do we need peer-review when data files are anyway made available online?”

It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.

That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to. If instead only data files are made available online there is no guarantee that anybody will ever fully evaluate the findings. This puts specific findings, and therefore specific researchers, under indefinite suspicion. This was a problem for the Replication Special Issue, and is also a problem for the Reproducibility Project, where findings are simply uploaded without having gone through any expert review.

Question 10: “What has been your experience with replication attempts?”

My work has been targeted for multiple replication attempts; by now I have received so many such requests that I stopped counting. Further, data detectives have demanded the raw data of some of my studies, as they have done with other researchers in the area of embodied cognition because somehow this research area has been declared “suspect.” I stand by my methods and my findings and have nothing to hide and have always promptly complied with such requests. Unfortunately, there has been little reciprocation on the part of those who voiced the suspicions; replicators have not allowed me input on their data, nor have data detectives exonerated my analyses when they turned out to be accurate.

I invite the data detectives to publicly state that my findings lived up to their scrutiny, and more generallly, share all their findings of secondary data analyses. Otherwise only errors get reported and highly publicized, when in fact the majority of research is solid and unproblematic.

With replicators alike, this has been a one-way street, not a dialogue, which is hard to reconcile with the alleged desire for improved research practices. So far I have seen no actual interest in the phenomena under investigation, as if the goal was to as quickly as possible declare the verdict of “failure to replicate.” None of the replicators gave me any opportunity to evaluate the data before claims of “failed” replications were made public. Replications could tell us a lot about boundary conditions of certain effects, which would drive science forward, but this needs to be a collaborative process.

The most stressful aspect has not been to learn about the “failed” replication by Johnson, Cheung & Donnellan, but to have had no opportunity to make myself heard. There has been the constant implication that anything I could possibly say must be biased and wrong because it involves my own work. I feel like a criminal suspect who has no right to a defense and there is no way to win: The accusations that come with a “failed” replication can do great damage to my reputation, but if I challenge the findings I come across as a “sore loser.”

Question 11: “But… shouldn’t we be more concerned about science rather than worrying about individual researchers’ reputations?”

Just like everybody else, I want science to make progress, and to use the best possible methods. We need to be rigorous about all relevant evidence, whether it concerns an original finding, or a “replication” finding. If we expect original findings to undergo tremendous scrutiny before they are published, we should expect no less of replication findings.

Further, careers and funding decisions are based on reputations. The implicit accusations that currently come with failure to replicate an existing finding can do tremendous damage to somebody’s reputation, especially if accompanied by mocking and bullying on social media. So the burden of proof needs to be high before claims about replication evidence can be made. As anybody who’s ever carried out a psychology study will know, there are many reasons why a study can go wrong. It is irresponsible to declare replication results without properly assessing the quality of the replication methods, analyses and conclusions.

Let’s also remember that there are human beings involved on both sides of the process. It is uncollegial, for example, to tweet and blog about people’s research in a mocking tone, accuse them of questionable research practices, or worse. Such behavior amounts to bullying, and needs to stop.

Question 12: “Do you think replication attempts selectively target specific research areas?”

Some of the findings obtained within my subject area of embodied cognition may seem surprising to an outsider, but they are theoretically grounded and there is a highly consistent body of evidence. I have worked in this area for almost 20 years and am confident that my results are robust. But somehow recently all findings related to priming and embodied cognition have been declared “suspicious.” As a result I have even considered changing my research focus to “safer” topics that are not constantly targeted by replicators and data detectives. I sense that others working in this area have suffered similarly, and may be rethinking their research priorities, which would result in a significant loss to scientific investigation in this important area of research.

Two main criteria appear to determine whether a finding is targeted for replication: Is a finding surprising, and can a replication be done with little resources, i.e., is a replication feasible. Whether a finding surprises you or not depends on familiarity with the literature —the lower your expertise, the higher your surprise. This makes counterintuitive findings a prime target, even when they are consistent with a large body of other findings. Thus, there is a disproportionate focus on areas with surprising findings for which replications can be conducted with limited resources. It also means that the burden of responding to “surprised” colleagues is very unevenly distributed and researchers like myself are targeted again and again, which will do little to advance the field as a whole. Such practices inhibit creative research on phenomena that may be counterintuitive because it motivates researchers to play it safe to stay well below the radar of the inquisitors.

Overall it’s a base rate problem: If replication studies are cherry-picked simply due to feasibility considerations, then a very biased picture will emerge, especially if failed replications are highly publicized. Certain research areas will suffer from the stigma of “replication failure” while other research areas will remained completely untouched. A truely scientific approach would be to randomly sample from the entire field, and conduct replications, rather than focus on the same topics (and therefore the same researchers) again and again.

Question 13: “So far, what has been the personal impact of the replication project on you?”

The “failed” replication of my work was widely announced to many colleagues in a group email and on Twitter already in December, before I had the chance to fully review the findings. So the defamation of my work already started several months before the paper even went into print. I doubt anybody would have widely shared the news had the replication been considered “successful.” At that time the replication authors also put up a blog entitled “Go Big or Go Home,” in which they declare their studies an “epic fail.” Considering that I helped them in every possible way, even offered to analyze the data for them, this was disappointing.

At a recent interview for a big grant I was asked to what extent my work was related to Diederik Stapel’s and therefore unreliable; I did not get the grant. The logic is remarkable: if researcher X faked his data, then the phenomena addressed by that whole field are called into question.

Further, following a recent submission of a manuscript to a top journal a reviewer raised the issue about a “failed” replication of my work and therefore called into question the validity of the methods used in that manuscript.

The constant suspicions create endless concerns and second thoughts that impair creativity and exploration. My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.

Over the last six months I have spent a tremendous amount of time and effort on attempting to ensure that the printed record regarding the replication of my work is correct. I made it my top priority to try to avert the accusations and defamations that currently accompany claims of failed replications. Fighting on this front has meant that I had very little time and mental energy left to engage in activities that would actually make a positive contribution to science, such as writing papers, applying for grants, or doing all the other things that would help me get a promotion.

Question 14: “Are you afraid that being critical of replications will have negative consequences?”

I have already been accused of “slinging mud”, “hunting for artifacts”, “soft fraud” and it is possible that further abuse and bullying will follow, as has happened to colleagues who responded to failed replications by pointing out errors of the replicators. The comments posted below the Science article suggest that even before my commentary was published people were quick to judge. In the absence of any evidence, many people immediately jumped to the conclusion that something must be wrong with my work, and that I’m probably a fraudster. I encourage the critics to read my commentary so you can arrive at a judgment based on the facts.

I will continue to defend the integrity of my work, and my reputation, as anyone would, but I fear that this will put me and my students even more directly into the cross hairs of replicators and data detectives. I know of many colleagues who are similarly afraid of becoming the next target and are therefore hesitant to speak out, especially those who are junior in their careers or untenured. There now is a recognized culture of “replication bullying:” Say anything critical about replication efforts and your work will be publicly defamed in emails, blogs and on social media, and people will demand your research materials and data, as if they are on a mission to show that you must be hiding something.

I have taken on a risk by publicly speaking out about replication. I hope it will encourage others to do the same, namely to stand up against practices that do little to advance science, but can do great damage to people’s reputations. Danny Kahneman, who so far has been a big supporter of replication efforts, has now voiced concerns about a lack of “replication etiquette,” where replicators make no attempts to work with authors of the original work. He says that “this behavior should be prohibited, not only because it is collegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.”

Let’s make replication efforts about scientific contributions, not about selectively targeting specific people, mockery, bullying and personal attacks. We could make some real progress by working together.

A note on comments: All first-time commenters’ postings must be approved — I approve all comments as soon as I can get to them, regardless of content, as long as they are civil and on topic. If you are interested in writing a full length blog post, I’m happy to consider posting it; email me at davenussbaum at gmail, or tweet me (@davenuss79)