Articles | COMPUTO

Published

2024

AdaptiveConformal: An R Package for Adaptive Conformal Inference

Herbert Susmann, Antoine Chambaz, and Julie Josse

Computo, 2024.

Abs HTML PDF git repo Review Bib

Conformal Inference (CI) is a popular approach for generating finite sample prediction intervals based on the output of any point prediction method when data are exchangeable. Adaptive Conformal Inference (ACI) algorithms extend CI to the case of sequentially observed data, such as time series, and exhibit strong theoretical guarantees without having to assume exchangeability of the observed data. The common thread that unites algorithms in the ACI family is that they adaptively adjust the width of the generated prediction intervals in response to the observed data. We provide a detailed description of five ACI algorithms and their theoretical guarantees, and test their performance in simulation studies. We then present a case study of producing prediction intervals for influenza incidence in the United States based on black-box point forecasts. Implementations of all the algorithms are released as an open-source R package, AdaptiveConformal, which also includes tools for visualizing and summarizing conformal prediction intervals.
Peerannot: classification for crowdsourced image datasets with Python

Tanguy Lefort, Benjamin Charlier, Alexis Joly, and Joseph Salmon

Computo, 2024.

Abs HTML PDF git repo Review Bib

Crowdsourcing is a quick and easy way to collect labels for large datasets, involving many workers. However, workers often disagree with each other. Sources of error can arise from the workers’ skills, but also from the intrinsic difficulty of the task. We present peerannot: a Python library for managing and learning from crowdsourced labels for classification. Our library allows users to aggregate labels from common noise models or train a deep learning-based classifier directly from crowdsourced labels. In addition, we provide an identification module to easily explore the task difficulty of datasets and worker capabilities.

crowdsourcing, label noise, task difficulty, worker ability, classification
Optimal projection for parametric importance sampling in high dimensions

Maxime El Masri, Jérôme Morio, and Florian Simatos

Computo, 2024.

Abs HTML PDF git repo Review Bib

In this paper we propose a dimension-reduction strategy in order to improve the performance of importance sampling in high dimension. The idea is to estimate variance terms in a small number of suitably chosen directions. We first prove that the optimal directions, i.e., the ones that minimize the Kullback–Leibler divergence with the optimal auxiliary density, are the eigenvectors associated to extreme (small or large) eigenvalues of the optimal covariance matrix. We then perform extensive numerical experiments that show that as dimension increases, these directions give estimations which are very close to optimal. Moreover, we show that the estimation remains accurate even when a simple empirical estimator of the covariance matrix is used to estimate these directions. These theoretical and numerical results open the way for different generalizations, in particular the incorporation of such ideas in adaptive importance sampling schemes

Rare event simulation, Parameter estimation, Importance sampling, Dimension reduction, Kullback–Leibler divergence, Projection
Point Process Discrimination According to Repulsion

Hamza Adrat, and Laurent Decreusefond

Computo, 2024.

Abs HTML PDF git repo Review Bib

In numerous applications, cloud of points do seem to exhibit repulsion in the intuitive sense that there is no local cluster as in a Poisson process. Motivated by data coming from cellular networks, we devise a classification algorithm based on the form of the Voronoi cells. We show that, in the particular set of data we are given, we can retrieve some repulsiveness between antennas, which was expected for engineering reasons.

classification, point process, repulsion
A hierarchical model to evaluate pest treatments from prevalence and intensity data

Armand Favrot, and David Makoswki

Computo, 2024.

Abs HTML PDF git repo Review Bib

In plant epidemiology, pest abundance is measured in field trials using metrics assessing either pest prevalence (fraction of the plant population infected) or pest intensity (average number of pest individuals present in infected plants). Some of these trials rely on prevalence, while others rely on intensity, depending on the protocols. In this paper, we present a hierarchical Bayesian model able to handle both types of data. In this model, the intensity and prevalence variables are derived from a latent variable representing the number of pest individuals on each host individual, assumed to follow a Poisson distribution. Effects of pest treaments, time trend, and between-trial variability are described using fixed and random effects. We apply the model to a real dataset in the context of aphid control in sugar beet fields. In this dataset, prevalence and intensity were derived from aphid counts observed on either factorial trials testing different types of pesticides treatments or field surveys monitoring aphid abundance. Next, we perform simulations to assess the impacts of using either prevalence or intensity data, or both types of data simultaneously, on the accuracy of the model parameter estimates and on the ranking of pesticide treatment efficacy. Our results show that, when pest prevalence and pest intensity data are collected separately in different trials, the model parameters are more accurately estimated using both types of trials than using one type of trials only. When prevalence data are collected in all trials and intensity data are collected in a subset of trials, estimations and pest treatment ranking are more accurate using both types of data than using prevalence data only. When only one type of observation can be collected in a pest survey or in an experimental trial, our analysis indicates that it is better to collect intensity data than prevalence data when all or most of the plants are expected to be infested, but that both types of data lead to similar results when the level of infestation is low to moderate. Finally, our simulations show that it is unlikely to obtain accurate results with fewer than 40 trials when assessing the efficacy of pest control treatments based on prevalence and intensity data. Because of its flexibility, our model can be used to evaluate and rank the efficacy of pest treatments using either prevalence or intensity data, or both types of data simultaneously. As it can be easily implemented using standard Bayesian packages, we hope that it will be useful to agronomists, plant pathologists, and applied statisticians to analyze pest surveys and field experiments conducted to assess the efficacy of pest treatments.

bayesian model, epidemiology, hierarchical model, pest control, trial, survey

2023

Local tree methods for classification: a review and some dead ends

Alice Cleynen, Louis Raynal, and Jean-Michel Marin

Computo, 2023.

Abs HTML PDF git repo Review Bib

Random Forests (RF) [@breiman:2001] are very popular machine learning methods. They perform well even with little or no tuning, and have some theoretical guarantees, especially for sparse problems [@biau:2012;@scornet:etal:2015]. These learning strategies have been used in several contexts, also outside the field of classification and regression. To perform Bayesian model selection in the case of intractable likelihoods, the ABC Random Forests (ABC-RF) strategy of @pudlo:etal:2016 consists in applying Random Forests on training sets composed of simulations coming from the Bayesian generative models. The ABC-RF technique is based on an underlying RF for which the training and prediction phases are separated. The training phase does not take into account the data to be predicted. This seems to be suboptimal as in the ABC framework only one observation is of interest for the prediction. In this paper, we study tree-based methods that are built to predict a specific instance in a classification setting. This type of methods falls within the scope of local (lazy/instance-based/case specific) classification learning. We review some existing strategies and propose two new ones. The first consists in modifying the tree splitting rule by using kernels, the second in using a first RF to compute some local variable importance that is used to train a second, more local, RF. Unfortunately, these approaches, although interesting, do not provide conclusive results.

classification, Random Forests, local methods
Computing an empirical Fisher information matrix estimate in latent variable models through stochastic approximation

Maud Delattre, and Estelle Kuhn

Computo, 2023.

Abs HTML PDF git repo Review Bib

The Fisher information matrix (FIM) is a key quantity in statistics. However its exact computation is often not trivial. In particular in many latent variable models, it is intricated due to the presence of unobserved variables. Several methods have been proposed to approximate the FIM when it can not be evaluated analytically. Different estimates have been considered, in particular moment estimates. However some of them require to compute second derivatives of the complete data log-likelihood which leads to some disadvantages. In this paper, we focus on the empirical Fisher information matrix defined as an empirical estimate of the covariance matrix of the score, which only requires to compute the first derivatives of the log-likelihood. Our contribution consists in presenting a new numerical method to evaluate this empirical Fisher information matrix in latent variable model when the proposed estimate can not be directly analytically evaluated. We propose a stochastic approximation estimation algorithm to compute this estimate as a by-product of the parameter estimate. We evaluate the finite sample size properties of the proposed estimate and the convergence properties of the estimation algorithm through simulation studies.

Model-based standard error, moment estimate, Fisher identity, stochastic approximation algorithm
Inference of Multiscale Gaussian Graphical Model

Edmond Sanou, Christophe Ambroise, and Geneviève Robin

Computo, 2023.

Abs HTML PDF git repo Review Bib

Gaussian Graphical Models (GGMs) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models rely on sparsity and clustering to reduce dimensionality and improve performances. This paper explores a slightly different paradigm where clustering is not knowledge-driven but performed simultaneously with the graph inference task. We introduce a novel Multiscale Graphical Lasso (MGLasso) to improve networks interpretability by proposing graphs at different granularity levels. The method estimates clusters through a convex clustering approach — a relaxation of k-means, and hierarchical clustering. The conditional independence graph is simultaneously inferred through a neighborhood selection scheme for undirected graphical models. MGLasso extends and generalizes the sparse group fused lasso problem to undirected graphical models. We use continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA) to propose a regularization path of solutions along the group fused Lasso penalty, while the Lasso penalty is kept constant. Extensive experiments on synthetic data compare the performances of our model to state-of-the-art clustering methods and network inference models. Applications to gut microbiome data and poplar’s methylation mixed with transcriptomic data are presented.

Neighborhood selection, Convex hierarchical clustering, Gaussian graphical models
Macrolitter Video Counting on Riverbanks Using State Space Models and Moving Cameras

Mathis Chagneux, Sylvain Le Corff, Pierre Gloaguen, Charles Ollion, Océane Lepâtre, and Antoine Bruge

Computo, 2023.

Abs HTML PDF git repo Review Bib

Litter is a known cause of degradation in marine environments and most of it travels in rivers before reaching the oceans. In this paper, we present a novel algorithm to assist waste monitoring along watercourses. While several attempts have been made to quantify litter using neural object detection in photographs of floating items, we tackle the more challenging task of counting directly in videos using boat-embedded cameras. We rely on multi-object tracking (MOT) but focus on the key pitfalls of false and redundant counts which arise in typical scenarios of poor detection performance. Our system only requires supervision at the image level and performs Bayesian filtering via a state space model based on optical flow. We present a new open image dataset gathered through a crowdsourced campaign and used to train a center-based anchor-free object detector. Realistic video footage assembled by water monitoring experts is annotated and provided for evaluation. Improvements in count quality are demonstrated against systems built from state-of-the-art multi-object trackers sharing the same detection capabilities. A precise error decomposition allows clear analysis and highlights the remaining challenges.
A Python Package for Sampling from Copulae: clayton

Alexis Boulin

Computo, 2023.

Abs HTML PDF git repo Review Bib

The package clayton is designed to be intuitive, user-friendly, and efficient. It offers a wide range of copula models, including Archimedean, Elliptical, and Extreme. The package is implemented in pure Python, making it easy to install and use. In addition, we provide detailed documentation and examples to help users get started quickly. We also conduct a performance comparison with existing R packages, demonstrating the efficiency of our implementation. The clayton package is a valuable tool for researchers and practitioners working with copulae in Python

Copulae, Random number generation

2022

Trade-off between deep learning for species identification and inference about predator-prey co-occurrence: Reproducible R workflow integrating models in computer vision and ecological statistics

Olivier Gimenez, Maelis Kervellec, Jean-Baptiste Fanjul, Anna Chaine, Lucile Marescot, Yoann Bollet, and Christophe Duchamp

Computo, 2022.

Abs HTML PDF git repo Review Bib

Deep learning is used in computer vision problems with important applications in several scientific fields. In ecology for example, there is a growing interest in deep learning for automatizing repetitive analyses on large amounts of images, such as animal species identification. However, there are challenging issues toward the wide adoption of deep learning by the community of ecologists. First, there is a programming barrier as most algorithms are written in Python while most ecologists are versed in R. Second, recent applications of deep learning in ecology have focused on computational aspects and simple tasks without addressing the underlying ecological questions or carrying out the statistical data analysis to answer these questions. Here, we showcase a reproducible R workflow integrating both deep learning and statistical models using predator-prey relationships as a case study. We illustrate deep learning for the identification of animal species on images collected with camera traps, and quantify spatial co-occurrence using multispecies occupancy models. Despite average model classification performances, ecological inference was similar whether we analysed the ground truth dataset or the classified dataset. This result calls for further work on the trade-offs between time and resources allocated to train models with deep learning and our ability to properly address key ecological questions with biodiversity monitoring. We hope that our reproducible workflow will be useful to ecologists and applied statisticians.

computer vision, deep-learning, species distribution modeling, ecological statistics

In the pipeline

Manuscript conditionally accepted, whose editorial and scientific reproducibility is being validated

Geometric-Based Pruning Rules for Change Point Detection in Multiple Independent Time Series

Liudmila Pishchagina, Guillem Rigaill, and Vincent Runge

Computo, 2024.

Abs HTML PDF git repo Review Bib

We address the challenge of identifying multiple change points in a group of independent time series, assuming these change points occur simultaneously in all series and their number is unknown. The search for the best segmentation can be expressed as a minimization problem over a given cost function. We focus on dynamic programming algorithms that solve this problem exactly. When the number of changes is proportional to data length, an inequality-based pruning rule encoded in the PELT algorithm leads to a linear time complexity. Another type of pruning, called functional pruning, gives a close-to-linear time complexity whatever the number of changes, but only for the analysis of univariate time series. We propose a few extensions of functional pruning for multiple independent time series based on the use of simple geometric shapes (balls and hyperrectangles). We focus on the Gaussian case, but some of our rules can be easily extended to the exponential family. In a simulation study we compare the computational efficiency of different geometric-based pruning rules. We show that for a small number of time series some of them ran significantly faster than inequality-based approaches in particular when the underlying number of changes is small compared to the data length.
Bayesian Spatiotemporal Modelling of Wildfire Occurrences and Sizes for Projections Under Climate Change

Juliette Legrand, François Pimont, Jean-Luc Dupuy, and Thomas Opitz

Computo, 2024.

Abs HTML PDF git repo Review Bib

Appropriate spatiotemporal modelling of wildfire activity is crucial for its prediction and risk management. Here, we focus on wildfire risk in the Aquitaine region in the Southwest of France and its projection under climate change. We study whether wildfire risk could further increase under climate change in this specific region, which does not lie in the historical core area of wildfires in Southeastern France, corresponding to the Southwest. For this purpose, we consider a marked spatiotemporal point process, a flexible model for occurrences and magnitudes of such environmental risks, where the magnitudes are defined as the burnt areas. The model is first calibrated using 14 years of past observation data of wildfire occurrences and weather variables, and then applied for projection of climate-change impacts using simulations of numerical climate models until 2100 as new inputs. We work within the framework of a spatiotemporal Bayesian hierarchical model, and we present the workflow of its implementation for a large dataset at daily resolution for 8km-pixels using the INLA-SPDE approach. The assessment of the posterior distributions shows a satisfactory fit of the model for the observation period. We stochastically simulate projections of future wildfire activity by combining climate model output with posterior simulations of model parameters. Depending on climate models, spline-smoothed projections indicate low to moderate increase of wildfire activity under climate change. The increase is weaker than in the historical core area, which we attribute to different weather conditions (oceanic versus Mediterranean). Besides providing a relevant case study of environmental risk modelling, this paper is also intended to provide a full workflow for implementing the Bayesian estimation of marked log-Gaussian Cox processes using the R-INLA package of the R statistical software.

Example: a mock contribution

This page is a reworking of the original t-SNE article using the Computo template. It aims to help authors submitting to the journal by using some advanced formatting features.

Visualizing Data using t-SNE: practical Computo example

Laurens Maaten, and Geoffrey Hinton

Computo, 2021.

Abs HTML PDF git repo Review Bib

We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding hinton:stochastic that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualization produced by t-SNE are significantly better than those produced by other techniques on almost all of the data sets.

template, documentation, quarto, R, python