An academic journal in statistics and machine learning promoting reproducibility and alternative publication mode
November 5, 2024
Editorial board
IT support
Stat. learning, DR INRAE
Paris-Saclay University
Statistique, DR CNRS
IMT Toulouse
Optim./Machine-Learning
CR INRIA Lyon
CS/Stats/ML, IR CNRS
IMAG, Montpellier University
Machine learning, CR CNRS
Grenoble Alpes University
Statistics, MCF
Institut Agro Rennes-Angers
Machine Learning
CR MinesParisTech
Stats/ML/dev, IR CNRS
LBMC, ENS LYON
Fundamentally, it provides three things:
Tools to reproduce the results (that’s like cooking)
A “recipe” to reproduce the results (still like cooking)
A path to understanding the results and the process that led to them (unlike cooking…1)
The pdf era and paper submission.
The reproducibility was not a priority:
And then in the Machine Learning domain, there was distill.pub [1]
but…
… engineering was too complex for the average scientist (a lot of javascript, etc.)
In fact, the distill.pub project was discontinued in 2021 [2]
distill.pub
distill.pub’s goals were right, but they outpaced themselves in terms of development complexity.
French statistical society (SFdS)
Assessment
😔 Multiplication of “traditional” journals…
😔 No valorization of “negative” results
😥 No or not enough valorization of source codes and case studies
😱 ↘ of publication quality and time dedicated to each article (on author or reviewer sides) [3]
😱 Issue with scientific reproducibility (analyses, experiments) [4–9]
Point of view
Scientific perimeter
Promote contribution in statistics and machine learning that provide insight into which models or methods are more appropriate to address a specific scientific question
Open access
⇝ In accordance with Budapest Open Access Initiative (BOAI) and Plan S
Reproducible
Official launch at the end of 2021
Notebook and literate programming
text (markdown) + math (\(\LaTeX\)) + code (Python/R/Julia), references (bib\(\TeX\))
Environment management, Compilation, Multi-format publication (pdf, html)
Continuous integration/Continuous deployment (CI/CD)
markdown
Rmarkdown
Pandoc
Credit: Pratik89Roy CC-BY-SA-4.0 from Wikimedia
with template notebook document + doc + pre-configured compilation and publication setup
Let’s go, locally (same spirit as Jupyter/Rmarkdown notebooks)
Configuration file versionned and used during CI compilation/publication action
A git push
command will trigger your article compilation (including computations) and publication as a github page1
See the preconfigured .github/workflows/build.yml
file for the github action configuration2
If the CI process succeeds, both HTML and PDF versions are published on the github-page associated to the repository
https://openreview.net/group?id=Computo
Submit:
After a “traditionnal” review process, a 3 step procedure:
including
🥲 Fully operational + doi, ISSN
🙂 13 published articles articles, 5 under review (more details here)
🙂 x presentations (Montpellier, Toronto, Humastica, Grenoble, RR2023, etc.)
🙂 French reproducible research network
🤯 Difficult to find reviewers
🤔 Institutional support?
🤔 Changing of practices in the scientific community?
quarto
: dynamic, agnostic language, FOSS1, community-based (pandoc
), Rstudio/Posit supportgithub
: dynamic, large user community but not institutional and limited computing resourcesReproducibility considerations
The global scientific workflow of a reproducible process for a Computo may be split into two types of steps:
Process to obtain (intermediate) results utside of the notebook environment, for a list of reasons (non-exclusive to each other):
notebook rendering with the results of the external process
Requirement
If the notebook contains everything to produce the final document
\(\Rightarrow\) “Direct reproducibility” in the sense that the notebook is the only thing needed to reproduce the results.
Ultimately, the workflow must end with a direct reproducibility step which concludes the whole process.
data produced by the external process \(\Rightarrow\) transferred to the notebook environment.
Requirement
Not only the intermediate results are provided, but also the code to transfer it in the notebook environment.
There are a variety of software solutions to do so.
joblib.Memory
, caching mechanism for python functions, save the results of a function call to disk, and load it back later..RData
file format, can be loaded back in R with the load()
function..csv
, .tsv
, .json
, etc.) is also a solution..joblib
directory or .Rdata
file) could be committed to the git repository, and directly loaded in the notebook environment.