Workshop on Reproducible Research

A hands-on introduction to quarto with a Computo submission

Julien Chiquet

INRAE, MIA-Paris

Pierre Neuvial

CNRS, IMT

Université de Toulouse

Ghislain Durif

CNRS

ENS Lyon, LBMC

Tanguy Lefort

Université de Montpellier, IMAG

CNRS, Inria, LIRMM

François-David Collin

CNRS

IMAG

May 31, 2024

Planning

  • 9h-10h: Introduction to Computo and Quarto
  • 10h-10h30: Coffee break/Discussion
  • 10h30-11h30: Hands-on with a toy example
  • 11h30-12h30: Follow-up and optional personal article submission

Learning objectives

  • Understand the benefits of reproducible research
  • Learn how to create a quarto document
  • Learn how to include code, data, and narrative text in a quarto document
  • Learn how to submit a quarto document to Computo
  • How to navigate the Computo submission process (optional)

Short introduction to Computo and quarto

Team

Editorial board

IT support

Julien Chiquet (chief editor)

Stat. learning, DR INRAE
Paris-Saclay University

Pierre Neuvial

Statistique, DR CNRS
IMT Toulouse

Mathurin Massias

Optim./Machine-Learning
CR INRIA Lyon

Fra.-Dav. Collin

CS/Stats/ML, IR CNRS
IMAG, Montpellier University

Nelle Varoquaux

Machine learning, CR CNRS
Grenoble Alpes University

Marie-Pierre Étienne

Statistics, MCF
Institut Agro Rennes-Angers

Chloé Azencott

Machine Learning
CR MinesParisTech

Ghislain Durif

Stats/ML/dev, IR CNRS
LBMC, ENS LYON

What is reproducible research?

Fundamentally, it provides three things:

Tools to reproduce the results (that’s like cooking)

A “recipe” to reproduce the results (still like cooking)

A path to understand the results and the process that led to them (unlike cooking…1)

Pre-Computo era

The pdf era and paper submission.

The reproducibility was not a priority:

  • Tools has to be bought, installed, and maintained
  • Data and code were not shared (social engineering
  • Even methodology details are often missing

Pre-Computo era (2)

And then in the Machine Learning domain, there was distill.pub [1]

  • state-of-the art visualizations
  • paradigm shift in the scientific publication: “distillation” of complex ideas
  • 100% reproducible (just a git clone and a few standard commands)

but…

Pre-Computo era (3)

 

engineering was too complex for the average scientist (a lot of javascript, etc.)

In fact, the distill.pub project was discontinued in 2021 [2]

distill.pub

The Rise of the Pragmatic

distill.pub’s goals were right, but they outpaced themselves in terms of development complexity.

  • Computo is a fresh start with a pragmatic approach
  • leverage what the scientific community is already using (Rmarkdown, Jupyter notebooks, etc.)

\(\Rightarrow\) bring the community to the higher standards

The Rise of the Pragmatic

distill.pub’s goals were right, but they outpaced themselves in terms of development complexity.

  • Computo is a fresh start with a pragmatic approach
  • leverage what the scientific community is already using (Rmarkdown, Jupyter notebooks, etc.)

Computo

The Rise of the Pragmatic

distill.pub’s goals were right, but they outpaced themselves in terms of development complexity.

  • Computo is a fresh start with a pragmatic approach
  • leverage what the scientific community is already using (Rmarkdown, Jupyter notebooks, etc.)

Computo

Origin of Computo (~ 2020s)

French statistical society appoints a “publication” committee (lead by Julien then Pierre) to develop a new journal

Assessment

  • 😔 Multiplication of “traditional” journals…
  • 😔 No valorization of “negative” results
  • 😥 No or not enough valorization of source codes and case studies
  • 😱 ↘ of publication quality and time dedicated to each article (on author or reviewer sides) [3]
  • 😱 Issue with scientific reproducibility (analyses, experiments) [49]

Point of view

  • Need for renewal regarding scientific research implementation
  • Need for higher standard regarding result publications

⇝ Emergence of “Computo” idea

Philosophy

Scientific perimeter

Promote contribution in statistics and machine learning that provide insight into which models or methods are more appropriate to address a specific scientific question

Open access

  • “Diamond” open access (free to publish and free to read, possible to reuse)
  • 🅭 🅯 Content published under CC-BY license (attribution, share, adapt)
  • Reviews and discussions available after acceptance for publication (anonymous reviews)

⇝ In accordance with Budapest Open Access Initiative (BOAI) and Plan S

Reproducible

  • Numerical (statistical) reproducibility is a necessary condition
  • Source code and data should be available, at least partly executed and fully executable

Note on reproducible research [1012]

Why reproducing scientific results?

  • To strengthen their credibility
  • To check for errors (everyone makes error at some point!!!)
  • To build new research upon them (science is incremental)

Issues?

  • Reproduce numerical scientific results is often difficult (technology/environment evolution, source code/environment configuration/software partially available or not available)
  • Waste of time and resources to reproduce existing non-reproducible results

Reproducible research?

  • For others but also for your future self
  • Improve result credibility
  • Facilitate future research works

Setup

Official launch at the end of 2021

“Economical” model

  • A few tenacious people…
  • Free/Open-source community tools (Pandoc, Quarto, Git forge)
  • Institutional support (INRAE, INRIA, CNRS, SFdS)

Functioning

Writing system

Notebook and literate programming
text (markdown) + math (\(\LaTeX\)) + code (Python/R/Julia), references (bib\(\TeX\))

Publication system

Environment management, Compilation, Multi-format publication (pdf, html)
Continuous integration/Continuous deployment (CI/CD)

Reviewing system

  • Anonymous exchange published after acceptance
  • Reviewer pool (you can join)
  • [Ongoing switch from Scholastica to Open review]

Solutions/Prototype

Reproducible article and computations

Automatic editorial reproducibility

Scientific validation

Note on literate programming


  • Literate programming [13]: notebook including text and code
  • Markup formatting language: e.g. markdown
  • Separate content from rendering (≠ “what you see is what you get” editors)
  • Rendering includes text, code and results (from code computations)


---
title: "My article"
---

We compute 1+1:

\`\`\`{r}
1+1
\`\`\`

Note on quarto



  • Generalization of Rmarkdown
  • Relying on top community tools like universal document converter Pandoc
  • Developed and supported by RStudio/Posit
  • Native support of complex documents (website, articles, books) and multiple languages for computations (R, Python, Julia)
  • Management of references, citations, figures, tables, metadata, etc.

Note on continuous integration

  • Implementation in git forges (e.g. github actions or gitlab CI/CD)
  • Triggered by commits
  • Automatic tests
  • Automatic deployment: package/software publication, website

Credit: Pratik89Roy CC-BY-SA-4.0 from Wikimedia

Tools for authors

Document model

quarto Computo extension

Document template

Git template repository

with template notebook document + doc + pre-configured compilation and publication setup

Locally

  • Text editor/IDE (VS Code, Rstudio, NeoVim, etc.)
  • Quarto (compilation)
  • Julia / R / Python code + computations
  • git versioning system

Author point of view (1/3)


Step 0: setup a git repository for your article

Startup from a template repository (R, Python, Julia)

Tip

You can host your git repository on github and soon an any gitlab forge1.


Step 1: write your article

Let’s go, locally (same spirit as Jupyter/Rmarkdown notebooks)

Author point of view (2/3)

Step 2: configure the environment (dependencies management)

venv: use a virtual environment and generate the requirements.txt file

# requirements.txt
jupyter
matplotlib
numpy

renv: generate the renv.lock file

renv::init()
renv::install("ggplot2")
# or equivalently install.packages("ggplot2")
renv::snapshot()

Pkg: native Julia package manager (with generated Project.toml et Manifest.toml files)

add Plots
add IJulia

Configuration file versionned and used during CI compilation/publication action

Step 3: (re)production

A git push command will trigger your article compilation (including computations) and publication as a github page1

See the preconfigured .github/workflows/build.yml file for the github action configuration2

Author point of view (3/3)


Step 4: submit your article

If the CI process succeeds, both HTML and PDF versions are published on the github-page associated to the repository



Scholastica Open review

https://openreview.net/group?id=Computo

Submit:

  • your article PDF (scientific content review)
  • your git repository (source code and reproducibility review)

Editor point of view

After a “traditionnal” review process, a 3 step procedure:

  1. Acceptance
  2. Pre-production
  3. Publication in Computo (with a DOI)

including

  • Copy of the author git repository to https://github.com/computorg/
  • Final version formatting
  • Review report publication
  • Registration in the journal bibliographic data base
  • Copy of the repository to Software Heritage for archiving
  • Publication of the article on the journal website

2year and a half report


🥲 Fully operational + doi, ISSN

🙂 7 published articles articles, 3 in preproduction, 6 under review (more details here)

🙂 x presentations (Montpellier, Toronto, Humastica, Grenoble, RR2023, etc.)

🙂 French reproducible research network

🤯 Difficult to find reviewers

🤔 Institutional support?

🤔 Changing of practices in the scientific community?


Discussion

About several choices

  • quarto: dynamic, agnostic language, FOSS1, community-based (pandoc), Rstudio/Posit support
  • github: dynamic, large user community but not institutional and limited computing resources

Comparison/inspiration

Perspectives


  • Provision of computing resources (to be able to run all computations)
  • Full gitlab support (CI/CD, docker, registry, etc.)
  • Switch to a french institutional gitlab forge?
  • Improve long-term reproducibility stack (docker container, GUIX fully reproducible environment, only at the end of the publication process, )


How to help?

  • By submitting1 your work!
  • By becoming reviewer2

References

1.
Olah, C and Carter, S 2017 Research debt. Distill. DOI: https://doi.org/10.23915/distill.00005
2.
Team, E 2021 Distill hiatus. Distill. DOI: https://doi.org/10.23915/distill.00031
3.
Hanson, M A, Barreiro, P G, Crosetto, P, and Brockington, D 2023 The strain on scientific publishing. DOI: https://doi.org/10.48550/arXiv.2309.15884
4.
Ioannidis, J P A 2005 Why Most Published Research Findings Are False. PLoS Medicine, 2(8): e124. DOI: https://doi.org/10.1371/journal.pmed.0020124
5.
Steen, R G 2011 Retractions in the scientific literature: Is the incidence of research fraud increasing? Journal of Medical Ethics, 37(4): 249–253. DOI: https://doi.org/10.1136/jme.2010.040923
6.
Allison, D B, Brown, A W, George, B J, and Kaiser, K A 2016 Reproducibility: A tragedy of errors. Nature, 530(7588): 27–29. DOI: https://doi.org/10.1038/530027a
7.
Bastian, H 2016 Reproducibility Crisis Timeline: Milestones in Tackling Research Reliability. URL https://absolutelymaybe.plos.org/2016/12/05/reproducibility-crisis-timeline-milestones-in-tackling-research-reliability/. [Online; accessed 22-March-2023]
8.
Whitfield, J 2021 Replication Crisis. London Review of Books, 43(19). URL https://www.lrb.co.uk/the-paper/v43/n19/john-whitfield/replication-crisis. [Online; accessed 22-March-2023]
9.
Hernández, J A and Colom, M 2023 Repeatability, Reproducibility, Replicability, Reusability (4R) in JournalsPolicies and Software/Data Management in Scientific Publications: A Survey, Discussion, and Perspectives. URL https://hal.science/hal-04322522. [Online; accessed 4-January-2024]
10.
Desquilbet, L L, Granger, S, Hejblum, B, Legrand, A, Pernot, P, Rougier, N P, Castro Guerra, E de, Courbin-Coulaud, M, Duvaux, L, Gravier, P, Le Campion, G, Roux, S, and Santos, F 2019 Vers une recherche reproductible. Unité régionale de formation à l’information scientifique et technique de Bordeaux. URL https://hal.science/hal-02144142
11.
Hejblum, B P, Kunzmann, K, Lavagnini, E, Hutchinson, A, Robertson, D, Jones, S, and Eckes-Shephard, A 2020 Realistic and Robust Reproducible Research for Biostatistics. DOI: https://doi.org/10.20944/preprints202006.0002.v1
12.
The Turing Way Community 2022 The Turing Way: A handbook for reproducible, ethical and collaborative research. DOI: https://doi.org/10.5281/zenodo.7625728
13.
Knuth, D E 1984 Literate programming. The Computer Journal, 27(2): 97–111.

Reproducibility considerations

ScientificExternalEditorial Start ExperimentCheckpointData TransferEnd ExperimentDirect ReproducibilityComputational Experiment

Scientific and editorial reproducibility

Two-fold reproducibility

The global scientific workflow of a reproducible process for a Computo may be split in two types of steps:

External and Editorial

External

External
Process to obtain (intermediate) results utside of the notebook environment, for a list of reasons (non-exclusive to each other):
  • the process is too long to be conducted in a notebook
  • the data to be processed is too big to be handled directly in the notebook
  • it needs a specific environment (e.g. a cluster, with gpus, etc.)
  • it needs to involve specific languages (e.g. C, C++, Fortran, etc.) or build tools (e.g. make, cmake, etc.)

Editorial

Editorial
notebook rendering with the results of the external process

Requirement

If the notebook contains everything to produce the final document

\(\Rightarrow\) “Direct reproducibility” in the sense that the notebook is the only thing needed to reproduce the results.

Ultimately, the workflow must end with a direct reproducibility step which concludes the whole process.

Reproducibility considerations (5)

Data transfer
When applicable, the switch from external to editorial reproducibility is done with a “data transfer” step,

data produced by the external process \(\Rightarrow\) transferred to the notebook environment.

Requirement

Not only the intermediate results are provided, but also the code to transfer it in the notebook environment.

They are a variety of software solutions to do so.

Examples of data transfer solutions

Intermediate results storage

  • Python: joblib.Memory, caching mechanism for python functions, save the results of a function call to disk, and load it back later.
  • R : .RData file format, can be loaded back in R with the load() function.
  • If results are small enough, storing in a text file (e.g. .csv, .tsv, .json, etc.) is also a solution.

Transfer of the results to the notebook environment

  • (.joblib directory or .Rdata file) could be committed to the git repository, and directly loaded in the notebook environment.
  • Alternative, centralize input data (when large enough) and intermediate results on a shared scientific provider (we recommend Zenodo for this purpose), and download them in the notebook environment.

Workshop

Quarto

In this workshop, we will learn how to use quarto to create a document that includes code, data, and narrative text. We will also learn how to make the CI (continuous integration) working.

The main pipeline, step by step

  • Template installation
  • computing environment : renv, conda, etc.
  • Authoring in the qmd
  • rendering locally
  • pushing to github

Getting started

To get started you will need to clone the mock template for this workshop. The template is available at

https://github.com/computorg/template-jds2024

Mock repository

https://github.com/computorg/template-jds2024

Creating a repo from a template

  1. On GitHub.com, navigate to the main page of the repository.
  2. Above the file list, click Use this template.
  3. Select Create a new repository.
  4. Select Include all branches.

Language version

Make a git clone of the repository you just templated and open it in your favorite IDE.

Python version

Rename the published-paper-tsne-python.qmd to published-paper-tsne.qmd

R version

Rename the published-paper-tsne-R.qmd to published-paper-tsne.qmd

Conclusion