Reproducible Research: Can you duplicate the study and results you reported…15 years ago?
Link to the last RSS article here: lavaan: An Open Source Structural Equation Modeling Package Using the R System for Statistical Modeling. -- Ed.
By Dr. Jon Starkweather, Research and Statistical Support Consultant
This month’s article concerns a topic which is often overlooked. Inspiration for this article was provided by a short course presented by Harrell (2012a) at the 5th Annual Bayesian Biostatistics conference attended by RSS staff. This article was written to provide some direction and tips for conducting and reporting research in such a way which allows the results to be duplicated at any time in the future. Essentially, the term reproducible research means just that; the research can be duplicated, exactly, at any time in the future. Reproducibility is one of the core principles of science and empirical decision making. Are the results which guide our decisions reliable? In other words, can results be consistently reproduced with other data; and even more importantly, can the results be reproduced with the same data which originally produced them? If results of a particular study cannot be replicated then those findings become suspect. Below we offer some practical suggestions to help researchers produce results and reports which can be reproduced in the future.
Use the Right Stuff and Sow the Seed
There are many types of stuff used in research. First, the apparatus, which includes a virtually limitless list of objects used for research, such as; surveys, Bunsen burners, particle accelerators, generators, chemicals, etc. Obviously, these objects should only be used when it can reasonably be expected that they themselves are reliable. But, that is not really what we are concerned with in this article. The types of stuff we are really concerned with here are software packages. If the software you are using for statistical computation cannot exactly reproduce a statistical estimate, then you are using the wrong software for statistical computation. Given the rapid development of relatively cheap computers, and the parallel evolution of more and more sophisticated statistical analyses, it is reasonable to expect a certain level of complexity to the research one is conducting. For example, often resampling techniques are used (e.g., bootstrapping) or Markov chain Monte Carlo (MCMC) methods are used – in either case, it is important that the quasi-random process(-es) be reproducible. This may at first seem to be a contradiction, however, most software capable of doing these types of procedures are also capable of indexing the random number generator so that the results can be replicated. Therefore, it is important to understand how the software you are using is generating random numbers and how to access the system to index a particular analysis or result. For example, it is common to use the ‘setseed’ function in R to index the random number generator. Below, we use a simple ‘sample’ function to randomly sample (without replacement) from a vector (x) of sequential values from 1 to 15. In the example below, we use the date (20121010; October 10, 2012) as the ‘seed’ in the ‘set.seed’ function.
As can be seen in the above image, four samples (each of size n = 3) are draw at random from the vector x and those values are different each time. Then, four samples are draw after setting the seed (to the same value each time) and those values are the same in each sample.
Written in Stone
Not only should a researcher be concerned with reproducing exact results (statistics), but the researcher should also be concerned with reproducing the report of those results. Common word processing software is convenient; it’s easy to use in order to produce a document with some formatting quickly. However, common word processing software often cannot be read by multiple collaborators/colleagues/users on different computers (i.e. operating systems). Although this area of software has improved drastically in the last ten years, there are often still differences between the same document produced, or even viewed, on different operating systems (even when using the same word processing software; see: Goldberg, 2005). For this reason, and because it offers integration with R, it is recommended that reports be generated using TeX/LaTeX (Knuth, 1995; see also: Wikipedia TeX article for a description). Reports can be written in TeX which allows the report to incorporate statistical programming code, graphics, and comments using various packages in R (Kuhn, 2012) and various packages in TeX/LaTeX. Furthermore, a TeX document can be processed on any computer (i.e. any operating system) using multiple TeX-based editors – and the produced document will appear/print exactly the same way (e.g., the document will look the same in Adobe, GhostScript, etc. regardless of operating system).
Another thing to remember when conducting statistical analysis (or any type of programming) is that the syntax, code, script, etc. should be easily understood by anyone who is likely to see it. In other words, while programming, you should include as many comments as necessary to make the actual code understandable to yourself and anyone else for the foreseeable future. Imagine trying to reproduce whatever research you are currently working on, twenty years from now; will you remember why you recoded that variable, why you used a particular missing value imputation technique? In essence, always use frequent, copious, descriptive, and intuitive comments in your code (or syntax). This recommendation is not oriented primarily to R users; researchers who use SPSS (or SAS) should also become habituated to using syntax even if the analysis only requires pointing and clicking through menus. The reason syntax is required is because menu options often change over time and the syntax will help persons in the future decipher what exactly was done and why – especially if copious comments accompany the working syntax or code. Other benefits of using syntax or code are that it preserves the order of what was done, and comments help inform or guide writing the formal report later.
Make it available
Finally, scientific results should not be accessible only to those fortunate enough to afford subscription fees to journals or access to libraries. To borrow from Stewart Brand (1987); “information wants to be free” (p. 202). Your report, including the data and code, should be available upon request; if not freely available on the web. Provide links and references to the appropriate parties who own the rights to proprietary materials if proprietary data or apparatus were used. Part of the value of scientific results comes from scientists’ (e.g., data analysts, graduate students, faculty, professional researchers, etc.) ethical responsibility to allow critical review and scrutinizing of their research. Without candid acknowledgment of limitations and the ability to verify findings, science becomes no more informative than rumor or speculation.
Until next time, all the leaves are brown…
References & Resources
Baggerly, K. A., & Berry, D. A. (2011). Reproducible Research. AMSTATnews (Column of American Statistical Association). Available at: http://magazine.amstat.org/blog/2011/01/01/scipolicyjan11/
Brand, S. (1987). The Media Lab: Inventing the Future at MIT. New York: Viking Press. Available at: Eagle Commons Library, Call Number: T171.M49 B73 1987 which can be accessed through the UNT library: http://www.library.unt.edu/
Gentleman, R., & Lang, D. T. (2004). Statistical Analyses and Reproducible Research. Bioconductor Project Working Papers. Working Paper 2. Available at: http://biostats.bepress.com/bioconductor/paper2/
Goldberg, J. (2005). MS-Word is not a document exchange format. Available at: http://goldmark.org//netrants/no-word/attach.html
Harrell, F. (2012a). Reproducible research. A short course given at the 5th Annual Bayesian Biostatistics Conference. Presentation available at: http://www.mdanderson.org/education-and-research/departments-programs-and-labs/departments-and-divisions/division-of-quantitative-sciences/pdf/frank-e-harrell-jr.pdf
Harrell, F. (2012b). Statistical Reporting. Department of Biostatistics. Vanderbilt University. Resource content available at: http://biostat.mc.vanderbilt.edu/wiki/Main/StatReport
Hollister, J. W., & Walker, H. A. (2007). Beyond data: Reproducible research in ecology and environmental sciences. Frontiers in Ecology and the Environment, 5(1), 11 – 12. Available at JSTOR: http://www.jstor.org/
Koenker, R., & Zeileis, A. (2009). On reproducible econometric research. Journal of Applied Economics, 24, 833 – 847. Available through JSTOR: http://www.jstor.org/
Kuhn, M. (2012). Comprehensive R Archive Network (CRAN) Reproducible Research. CRAN Task View. Available at: http://cran.r-project.org/web/views/ReproducibleResearch.html
Mesirov, J. P. (2010). Accessible Reproducible Research. Science, 22, 415 – 416. Available at www.sciencemag.org
Wikipedia. (2012). TeX. Available at: http://en.wikipedia.org/wiki/TeX