Computo is not just about publishing a notebook and proving that it can be compiled with CI! This part of the process is what we call “Editorial Reproducibility”. “Scientific” or “numerical” reproducibility of the analyses is also mandatory, on top of classical peer-review evaluation.

We don’t ask people reproducing their data… yet! We also don’t ask for “bit-wise computational” reproducibility (i.e. obtaining exactly the same results bit-by-bit) but rather a “statistical” reproducibility, i.e. obtaining results leading to the same conclusion, with potential non-significant statistical variability.

Reproducible Workflow

Indeed, the global scientific workflow of a reproducible process for a Computo may be split in two types of steps:

External
This part of the process may be conducted outside of the notebook environment, for a list of reasons (non-exclusive to each other):
  • the process is too long to be conducted in a notebook
  • the data to be processed is too big to be handled directly in the notebook
  • it needs a specific environment (e.g. a cluster, with gpus, etc.)
  • it needs to involve specific languages (e.g. C, C++, Fortran, etc.) or build tools (e.g. make, cmake, etc.)

It is “Computational reproducibility”, where the reproducibility is achieved by providing the code and the environment to run it, but not the results themselves.

Editorial
This is where the notebook presents the results of the external process, and where everything is put together to produce the final document, it is “Direct reproducibility” in the sense that the notebook is the only thing needed to reproduce the results.

Ultimately, the workflow must end with a direct reproducibility step which concludes the whole process.

When applicable, the switch from external to editorial reproducibility is done with a “data transfer” step, where the data produced by the external process is transferred to the notebook environment. It’s required that not only the intermediate results are provided, but also the code to transfer it in the notebook environment. They are a variety of software solutions to do so.

Examples of data transfer solutions

Intermediate results storage

  • in python environment: the joblib.Memory class which provides a caching mechanism for python functions, and can be used to save the results of a function call to disk, and load it back later.
  • in R environment: the .RData file format, which can be loaded back in R with the load() function.

Transfer of the results to the notebook environment

  • for both aforementioned solutions, the results (.joblib directory or .Rdata file) could be committed to the git repository, and directly loaded in the notebook environment.
  • Another solution is to centralize input data (when large enough) and intermediate results on a shared scientific provider (we recommend Zenodo for this purpose), and download them in the notebook environment.