Independent verification of published results is a cornerstone of the scientific process. In this era of genomic science, an equally important criterion is “reproducibility,” whether independent scientists can replicate published analyses and results from the original data1. Reproducibility is discussed in detail in the July issue of Biostatistics2. The need for reproducibility was recently highlighted in an episode at Duke University, in which three clinical trials were conducted using genomic “signatures” to choose patients' cancer therapy. Scientists raised questions about these signatures after independent re-analyses failed to reproduce them and identified severe errors. Trials were suspended last October, but restarted in February, albeit with no means for independent assessment of reproducibility. Only after an outcry in mid-July over a PI's alleged resume padding were trials once again halted and the reproducibility of the PI’s work reconsidered. This is not how science should work. The re-analyses of the Duke signatures required thousands of man-hours of work, because the information accompanying papers published in Nature Medicine, NEJM, and the Lancet Oncology, among others, was insufficient. Unfortunately, this is not uncommon. In a survey of eighteen quantitative microarray papers, Ioannidis et al. (2009)3 were able to exactly reproduce the results for only two. They declared reproducibility impossible for fully half due to a lack of adequate information. To fix this problem, journals should require authors to submit more complete data and information supporting the conclusions in the paper, with a degree of detail sufficient for independent assessment of the validity and reproducibility of the results. Specifically, we recommend that the following be required: 1. Primary Data. Ideally, the entire set of data used to derive the conclusions, with adequate documentation and sample annotation. 2. Provenance. Primary data sources (database accessions or appropriate URLs). 3. Software Code. Ideally, all scripts used in the analyses, along with any instructions necessary to run the code. 4. Analytical Protocols. Step-by-step descriptions of all non-scriptable steps. 5. Research Protocol. Pre-specified research plans, if they exist. Fuller descriptions of these elements are provided at http://groups.google.com/group/reproducible-research/files?hl=en. We recognize that there are some situations, such as protecting patient confidentiality, in which all data or code cannot be supplied. However, in those cases the authors should justify the omission and describe alternate steps taken to assure independent reproducibility. While this issue is complex, we see these recommendations as important first steps. We know that this will require extra effort from authors to publish their work, but we believe it is worthwhile. As a community, we owe it to our colleagues, the patients, and the public to assure the validity of our work. Sincerely,
1. Peng RD. Reproducible research and Biostatistics. Biostatistics 2009; 10:405-8. 2. Diggle PJ, Zeger SL. Editorial. Biostatistics 2010; 11:375. 3. Ioannidis JP, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009; 41:149-55.